Paper: Exploiting Comparable Corpora with TER and TERp

ACL ID W09-3109
Title Exploiting Comparable Corpora with TER and TERp
Venue Building and Using Comparable Corpora
Year 2009

In this paper we present an extension of a successful simple and effective method for extracting parallel sentences from com- parable corpora and we apply it to an Arabic/English NIST system. We exper- iment with a new TERp filter, along with WER and TER filters. We also report a comparison of our approach with that of (Munteanu and Marcu, 2005) using ex- actly the same corpora and show perfor- mance gain by using much lesser data. Our approach employs an SMT system built from small amounts of parallel texts to translate the source side of the non- parallel corpus. The target side texts are used, along with other corpora, in the lan- guage model of this SMT system. We then use information retrieval techniques and simple filters to create parallel data from a comparable news corpora. We eva...