Paper: How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT

ACL ID E14-1061
Title How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Compounding in morphologically rich languages is a highly productive process which often causes SMT approaches to fail because of unseen words. We present an approach for translation into a com- pounding language that splits compounds into simple words for training and, due to an underspecified representation, allows for free merging of simple words into compounds after translation. In contrast to previous approaches, we use features pro- jected from the source language to predict compound mergings. We integrate our ap- proach into end-to-end SMT and show that many compounds matching the reference translation are produced which did not ap- pear in the training data. Additional man- ual evaluations support the usefulness of generalizing compound formation in SMT.