Paper: Improved Typesetting Models for Historical OCR

ACL ID P14-2020
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

We present richer typesetting models that extend the unsupervised historical document recognition system of Berg- Kirkpatrick et al. (2013). The first model breaks the independence assump- tion between vertical offsets of neighbor- ing glyphs and, in experiments, substan- tially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and non- italic portions of documents. Richer mod- els complicate inference so we present a new, streamlined procedure that is over 25x faster than the method used by Berg- Kirkpatrick et al. (2013). Our final sys- tem achieves a relative word error reduc- tion of 22% compared to state-of-the-art results on a dataset of historical newspa- pers.