Paper: Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints

ACL ID P14-1128
Title Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

This study investigates on building a better Chinese word segmentation mod- el for statistical machine translation. It aims at leveraging word boundary infor- mation, automatically learned by bilin- gual character-based alignments, to induce a preferable segmentation model. We propose dealing with the induced word boundaries as soft constraints to bias the continuous learning of a supervised CRF- s model, trained by the treebank data (la- beled), on the bilingual data (unlabeled). The induced word boundary information is encoded as a graph propagation con- straint. The constrained model induction is accomplished by using posterior reg- ularization algorithm. The experiments on a Chinese-to-English machine transla- tion task reveal that the proposed model can bring positive segmentation eff...