Paper: A Semi-Supervised Batch-Mode Active Learning Strategy for Improved Statistical Machine Translation

ACL ID W10-2916
Title A Semi-Supervised Batch-Mode Active Learning Strategy for Improved Statistical Machine Translation
Venue International Conference on Computational Natural Language Learning
Session Main Conference
Year 2010
Authors

The availability of substantial, in-domain parallel corpora is critical for the develop- ment of high-performance statistical ma- chine translation (SMT) systems. Such corpora, however, are expensive to pro- duce due to the labor intensive nature of manual translation. We propose to al- leviate this problem with a novel, semi- supervised, batch-mode active learning strategy that attempts to maximize in- domain coverage by selecting sentences, which represent a balance between domain match, translation difficulty, and batch di- versity. Simulation experiments on an English-to-Pashto translation task show that the proposed strategy not only outper- forms the random selection baseline, but also traditional active learning techniques based on dissimilarity to existing training data. Our approa...