Paper: Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

ACL ID P09-1012
Title Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2009
Authors

In this paper, we propose a new Bayesian model for fully unsupervised word seg- mentation and an ef cient blocked Gibbs sampler combined with dynamic program- ming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is em- bedded in the word model. We con rmed that it signi cantly outperforms previous reported results in both phonetic tran- scripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to con- struct an accurate word n-gram language model directly from characters of arbitrary language, without any word indications.