Paper: Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

ACL ID P14-2122
Title Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Unsupervised word segmentation (UWS) can provide domain-adaptive segmenta- tion for statistical machine translation (SMT) without annotated data, and bilin- gual UWS can even optimize segmenta- tion for alignment. Monolingual UWS ap- proaches of explicitly modeling the proba- bilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as ba- sic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the pro- posed method is comparable to super- vised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLE...