Paper: Improved Typesetting Models for Historical OCR

ACL ID P14-2020
Title Improved Typesetting Models for Historical OCR
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

We present richer typesetting models that extend the unsupervised historical document recognition system of Berg- Kirkpatrick et al. (2013). The first model breaks the independence assump- tion between vertical offsets of neighbor- ing glyphs and, in experiments, substan- tially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to accurately track italic and non- italic portions of documents. Richer mod- els complicate inference so we present a new, streamlined procedure that is over 25x faster than the method used by Berg- Kirkpatrick et al. (2013). Our final sys- tem achieves a relative word error reduc- tion of 22% compared to state-of-the-art results on a dataset of historical newspa- pers.