Paper: Structured and Unstructured Cache Models for SMT Domain Adaptation

ACL ID E14-1017
Title Structured and Unstructured Cache Models for SMT Domain Adaptation
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

We present a French to English transla- tion system for Wikipedia biography ar- ticles. We use training data from out- of-domain corpora and adapt the system for biographies. We propose two forms of domain adaptation. The first biases the system towards words likely in biogra- phies and encourages repetition of words across the document. Since biographies in Wikipedia follow a regular structure, our second model exploits this structure as a sequence of topic segments, where each segment discusses a narrower subtopic of the biography domain. In this structured model, the system is encouraged to use words likely in the current segment?s topic rather than in biographies as a whole. We implement both systems using cache- based translation techniques. We show that a system trained on Europarl a...