Paper: Automatic Parallel Fragment Extraction from Noisy Data

ACL ID N12-1061
Title Automatic Parallel Fragment Extraction from Noisy Data
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2012
Authors

We present a novel method to detect parallel fragments within noisy parallel corpora. Isolat- ing these parallel fragments from the noisy data in which they are contained frees us from noisy alignments and stray links that can severely constrain translation-rule extraction. We do this with existing machinery, making use of an existing word alignment model for this task. We evaluate the quality and utility of the ex- tracted data on large-scale Chinese-English and Arabic-English translation tasks and show sig- nificant improvements over a state-of-the-art baseline.