Paper: Submodularity for Data Selection in Machine Translation

ACL ID D14-1014
Title Submodularity for Data Selection in Machine Translation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014

We introduce submodular optimization to the problem of training data subset selection for statistical machine translation (SMT). By explicitly formulating data selection as a submodular program, we ob- tain fast scalable selection algorithms with mathematical performance guarantees, re- sulting in a unified framework that clarifies existing approaches and also makes both new and many previous approaches easily accessible. We present a new class of submodular functions designed specifically for SMT and evaluate them on two differ- ent translation tasks. Our results show that our best submodular method significantly outperforms several baseline methods, including the widely-used cross-entropy based data selection method. In addition, our approach easily scales to large data sets and is appli...