Paper: An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

ACL ID D10-1081
Title An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2010
Authors

This paper proposes a fast and simple unsuper- vised word segmentation algorithm that uti- lizes the local predictability of adjacent char- acter sequences, while searching for a least- effort representation of the data. The model uses branching entropy as a means of con- straining the hypothesis space, in order to ef- ficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in lan- guage development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time.