Paper: Unsupervised Transcription of Historical Documents

ACL ID P13-1021
Title Unsupervised Transcription of Historical Documents
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2013

We present a generative probabilistic model, inspired by historical printing pro- cesses, for transcribing images of docu- ments from the printing press era. By jointly modeling the text of the docu- ment and the noisy (but regular) process of rendering glyphs, our unsupervised sys- tem is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially out- performs state-of-the-art solutions for this task, achieving a 31% relative reduction in word error rate over the leading com- mercial system for historical transcription, and a 47% relative reduction over Tesser- act, Google?s open source OCR system.