Paper: Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text

ACL ID W11-1218
Title Building a Web-Based Parallel Corpus and Filtering Out Machine-Translated Text
Venue Building and Using Comparable Corpora
Session
Year 2011
Authors

We describe a set of techniques that have been developed while collecting parallel texts for Russian-English language pair and building a corpus of parallel sentences for training a statistical machine translation system. We discuss issues of verifying potential parallel texts and filtering out automatically translated documents. Finally we evaluate the quality of the 1-million- sentence corpus which we believe may be a useful resource for machine translation research.