Paper: Microblogs as Parallel Corpora

ACL ID P13-1018
Title Microblogs as Parallel Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2013

In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring par- allel text: some users create post multilingual mes- sages targeting international audiences while oth- ers ?retweet? translations. We present an efficient method for detecting these messages and extract- ing parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counter- part of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields sub- stantial translation quality improvements in trans- lating microblog text and modest improvements in translating edited news commentary. The re- sources in described in this paper are available at h...