Paper: Effective Selection of Translation Model Training Data

ACL ID P14-2093
Title Effective Selection of Translation Model Training Data
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

Data selection has been demonstrated to be an effective approach to addressing the lack of high-quality bitext for statisti- cal machine translation in the domain of interest. Most current data selection methods solely use language models trained on a small scale in-domain data to select domain-relevant sentence pairs from general-domain parallel corpus. By contrast, we argue that the relevance be- tween a sentence pair and target domain can be better evaluated by the combina- tion of language model and translation model. In this paper, we study and exper- iment with novel methods that apply translation models into domain-relevant data selection. The results show that our methods outperform previous methods. When the selected sentence pairs are evaluated on an end-to-end MT ...