Paper: Statistical Machine Translation With Word- And Sentence-Aligned Parallel Corpora

ACL ID P04-1023
Title Statistical Machine Translation With Word- And Sentence-Aligned Parallel Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2004
Authors

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such mod- els can be achieved by additionally including word- aligned data during training. Incorporating word- level alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. On the Verbmobil data set, we attain a 38% reduction in the alignment error rate and a higher Bleu score with half as many training examples. We discuss how varying the ratio of word-aligned to sentence- aligned data affects the expected performance gain.