Paper: A Joint Model for Unsupervised Chinese Word Segmentation

ACL ID D14-1092
Title A Joint Model for Unsupervised Chinese Word Segmentation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014

In this paper, we propose a joint model for unsupervised Chinese word segmentation (CWS). Inspired by the ?products of ex- perts? idea, our joint model firstly com- bines two generative models, which are word-based hierarchical Dirichlet process model and character-based hidden Markov model, by simply multiplying their proba- bilities together. Gibbs sampling is used for model inference. In order to further combine the strength of goodness-based model, we then integrated nVBE into our joint model by using it to initializing the Gibbs sampler. We conduct our experi- ments on PKU and MSRA datasets pro- vided by the second SIGHAN bakeoff. Test results on these two datasets show that the joint model achieves much bet- ter results than all of its component mod- els. Statistical significance tes...