Combining Trigram and Winnow in Thai OCR Error Correction

ACL ID P98-2138
Title Combining Trigram and Winnow in Thai OCR Error Correction
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1998

For languages that have no explicit word bound- ary such as Thai, Chinese and Japanese, cor- recting words in text is harder than in English because of additional ambiguities in locating er- ror words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we pro- pose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by ap- plying word segmentation algorithm and finding word sequences with low probability. To gener- ate the candidate correction words, we used a modified edit distance which reflects the charac- teristic of Thai OCR errors. Finally, a par...