Paper: Improving Statistical Machine Translation Performance by Training Data Selection and Optimization

ACL ID D07-1036
Title Improving Statistical Machine Translation Performance by Training Data Selection and Optimization
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2007
Authors

Parallel corpus is an indispensable resource for translation model training in statistical machine translation (SMT). Instead of col- lecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting full potential of the existing parallel corpora. Two kinds of methods are proposed: offline data optimi- zation and online model optimization. The offline method adapts the training data by redistributing the weight of each training sentence pairs. The online method adapts the translation model by redistributing the weight of each predefined submodels. In- formation retrieval model is used for the weighting scheme in both methods. Ex- perimental results show that without using any additional resource, both methods can improve SMT performance significantl...