Paper: N-Th Order Ergodic Multigram HMM For Modeling Of Languages Without Marked Word Boundaries

ACL ID C96-1036
Title N-Th Order Ergodic Multigram HMM For Modeling Of Languages Without Marked Word Boundaries
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1996
Authors

I,;rgodie IIMMs have been successfully used for modeling sentence production. llowever for some oriental languages such as Chinese, a word can consist of multiple characters without word bound- ary markers between adjacent words in a sentence. This makes word- segmentation on the training and testing data necessary before ergodic ItMM can be applied as the langnage model. This paper introduces the N-th order Ergodic Mnltigram HMM for language modeling of such languages. Each state of the IIMM can generate a variable number of characters corresponding to one word. The model can be trained without word- segmented and tagged corpus, and both segmentation and tagging are trained in one single model. Results on its applicw Lion on a Chinese corpus are reported. 1 Motivation Statistical language...