Paper: Domain Adaptation via Pseudo In-Domain Data Selection

ACL ID D11-1033
Title Domain Adaptation via Pseudo In-Domain Data Selection
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011
Authors

We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large general- domain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora – 1% the size of the original – can then used to train small domain-adapted Statistical Ma- chine Translation (SMT) systems which out- perform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better,...