Paper: Language Determination: Natural Language Processing From Scanned Document Images

ACL ID A94-1003
Title Language Determination: Natural Language Processing From Scanned Document Images
Venue Applied Natural Language Processing Conference
Session Main Conference
Year 1994
Authors

Many documents are available to a computer only as images from paper. However, most nat- ural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for con- verting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is suffi- cient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.