Paper: Latent Domain Translation Models in Mix-of-Domains Haystack

ACL ID C14-1182
Title Latent Domain Translation Models in Mix-of-Domains Haystack
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

This paper addresses the problem of selecting adequate training sentence pairs from a mix-of- domains parallel corpus for a translation task represented by a small in-domain parallel corpus. We propose a novel latent domain translation model which includes domain priors, domain- dependent translation models and language models. The goal of learning is to estimate the probability of a sentence pair in mix-domain corpus to be in- or out-domain using in-domain corpus statistics as prior. We derive an EM training algorithm and provide solutions for esti- mating out-domain models (given only in- and mix-domain data). We report on experiments in data selection (intrinsic) and machine translation (extrinsic) on a large parallel corpus consisting of a mix of a rather diverse set of domains. Our re...