Paper: Modeling Content Identification From Document Images

ACL ID A94-1004
Title Modeling Content Identification From Document Images
Venue Applied Natural Language Processing Conference
Session Main Conference
Year 1994

A new technique to locate content-represent- ing words for a given document image using abstract representation of character shapes is described. A character shape code representa- tion defined by the location of a character in a text line has been developed. Character shape code generation avoids the computational expense of conventional optical character rec- ognition (OCR). Because character shape codes are an abstraction of standard character code (e.g. , ASCII), the mapping is ambiguous. In this paper, the ambiguity is shown to be practically limited to an acceptable level. It is illustrated that: first, punctuation marks are clearly distinguished from the other charac- ters; second, stop words are generally distin- guishable from other words, because the permutations of character sha...