Paper: Automatic Discovery Of Non-Compositional Compounds In Parallel Data

ACL ID W97-0311
Title Automatic Discovery Of Non-Compositional Compounds In Parallel Data
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 1997
Authors

Automatic segmentation of text into min- imal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first ap- proximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not trans- lated word-for-word. This paper presents an efficient automatic method for discover- ing sequences of words that are translated as a unit. The method proceeds by com- paring pairs of statistical translation mod- els induced from parallel texts in two lan- guages. It can discover hundreds of non- compositional compounds on each itera- tion, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the qual...