Paper: Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation

ACL ID D14-1173
Title Refining Word Segmentation Using a Manually Aligned Corpus for Statistical Machine Translation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

Languages that have no explicit word de- limiters often have to be segmented for sta- tistical machine translation (SMT). This is commonly performed by automated seg- menters trained on manually annotated corpora. However, the word segmentation (WS) schemes of these annotated corpora are handcrafted for general usage, and may not be suitable for SMT. An analysis was performed to test this hypothesis us- ing a manually annotated word alignment (WA) corpus for Chinese-English SMT. An analysis revealed that 74.60% of the sentences in the WA corpus if segmented using an automated segmenter trained on the Penn Chinese Treebank (CTB) will contain conflicts with the gold WA an- notations. We formulated an approach based on word splitting with reference to the annotated WA to alleviate these con- ...