Paper: Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model

ACL ID C98-2147
Title Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1998
Authors
  • Masaaki Nagata (NTT Information and Communication Systems Laboratories, Yokosuka Japan)

We present a novel OCR error correction method for languages without word delimiters that have a large character set, such as Japanese and Chinese. It consists of a statistical OCR model, an approxi- mate word matching method using character shape similarity, and a word segmentation algorithm us- ing a statistical language model. By using a sta- tistical OCR model and character shape similarity, the proposed error corrector outperforms the previ- ously published method. When the baseline char- acter recognition accuracy is 90%, it achieves 97.4% character recognition accuracy.