Paper: A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation

ACL ID I05-1047
Title A Chunking Strategy Towards Unknown Word Detection in Chinese Word Segmentation
Venue International Joint Conference on Natural Language Processing
Session Main Conference
Year 2005
Authors

This paper proposes a chunking strategy to detect unknown words in Chinese word segmentation. First, a raw sentence is pre-segmented into a sequence of word atoms 1 using a maximum matching algorithm. Then a chunking model is applied to detect unknown words by chunking one or more word atoms together according to the word formation patterns of the word atoms. In this paper, a discriminative Markov model, named Mutual Information Independence Model (MIIM), is adopted in chunking. Besides, a maximum entropy model is applied to integrate various types of contexts and resolve the data sparseness problem in MIIM. Moreover, an error-driven learning approach is proposed to learn useful contexts in the maximum entropy model. In this way, the number of contexts in the maximum entropy mo...