Paper: An Automatic Filter For Non-Parallel Texts

ACL ID P04-3006
Title An Automatic Filter For Non-Parallel Texts
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2004

Numerous cross-lingual applications, including state-of-the-art machine translation systems, require parallel texts aligned at the sentence level. However, collections of such texts are often polluted by pairs of texts that are comparable but not parallel. Bitext maps can help to discriminate between parallel and comparable texts. Bitext mapping algorithms use a larger set of document features than competing ap- proaches to this task, resulting in higher accuracy. In addition, good bitext mapping algorithms are not limited to documents with structural mark-up such as web pages. The task of filtering non-parallel text pairs represents a new application of bitext mapping algorithms.