Paper: Mostly-Unsupervised Statistical Segmentation Of Japanese: Applications To Kanji

ACL ID A00-2032
Title Mostly-Unsupervised Statistical Segmentation Of Japanese: Applications To Kanji
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2000
Authors

Given the lack of word delimiters in written Japanese, word segmentation is generally consid- ered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely ei- ther on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing unsegmented training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyz- ers over a variety of error metrics.