Paper: Dirt Cheap Web-Scale Parallel Text from the Common Crawl

ACL ID P13-1135
Title Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2013

Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting par- allel text, but crawling the entire web is impossible for all but the largest compa- nies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon?s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the result- ing data boo...