Paper: Content Characterization Using Word Shape Tokens

ACL ID C94-2108
Title Content Characterization Using Word Shape Tokens
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1994

By quickly classifying character images into character shape categories, il is possible to automatically extract syntactic information from the text of document images without optical character recognition. Using word shape tokens composed of these charactershapecodes, a properly mr|ned text tagger can extract part-of.speech information fronl scanned document images. Later components of a document processing system can then use this information to locate topics, characterize document style, and assist ill inlormation rctriewll. extract noun phrases and other content characteristics using only word shape tokens that have been tagged with their parts of speech. Using this approach, we can process document images quickly to determine whether OCP, is warranted, tbrexample, when a text is a lik...