ACL Anthology Network (All About NLP) (beta) The Association Of Computational Linguistics Anthology Network |
ACL ID | C96-2138 |
---|---|
Title | Content-Oriented Categorization Of Document Images |
Venue | International Conference on Computational Linguistics |
Session | Main Conference |
Year | 1996 |
Authors |
|
We have developed a technique that catego- rizes document images based on their con- tent. Unlike conventional methods that use optical character recognition (OCR), we con- vert document images into word shape tokens, a shape-based representation of words. Because we have only to recognize simple graphical features from image, this process is much faster than OCR. Although the mapping between word shape tokens and words is one-to-many, they are a rich source of information for content characterization. Using a vector space classifier with a scanned document image database, we show that the word shape token-based approach is quite adequate for content-oriented categori- zation in terms of accuracy compared with conventional OCR-based approaches.