Paper: Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

ACL ID D10-1077
Title Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2010
Authors

Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reli- able segmentation techniques are always re- quired to make sure those tasks well- performed. In recent years, machine learning and sequence labeling models such as Condi- tional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new do- mains are to be processed. In this paper, we propose to use χ 2 statistics when training an ...