Paper: Exploiting Variant Corpora For Machine Translation

ACL ID N06-2029
Title Exploiting Variant Corpora For Machine Translation
Venue Human Language Technologies
Session Short Paper
Year 2006
Authors
  • Michael Paul (National Institute of Information and Communications Technology, Kyoto Japan; ATR Spoken Language Communication Research Laboratories, Kyoto Japan)
  • Eiichiro Sumita

This paper proposes the usage of variant corpora, i.e., parallel text corpora that are equal in meaning but use different ways to express content, in order to improve corpus-based machine translation. The us- age of multiple training corpora of the same content with different sources results in variant models that focus on specific linguistic phenomena covered by the re- spective corpus. The proposed method applies each variant model separately re- sulting in multiple translation hypotheses which are selectively combined accord- ing to statistical models. The proposed method outperforms the conventional ap- proach of merging all variants by reducing translation ambiguities and exploiting the strengths of each variant model.