Paper: Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT

ACL ID W10-4110
Title Mining Large-scale Parallel Corpora from Multilingual Patents: An English-Chinese example and its application to SMT
Venue Joint Conference on Chinese Language Processing
Session Main Conference
Year 2010
Authors

In this paper, we demonstrate how to mine large-scale parallel corpora with multilingual patents, which have not been thoroughly explored before. We show how a large-scale English-Chinese parallel corpus containing over 14 million sentence pairs with only 1-5% wrong can be mined from a large amount of English-Chinese bilingual patents. To our knowledge, this is the largest single parallel corpus in terms of sentence pairs. Moreover, we estimate the potential for mining multilingual parallel corpora involving English, Chinese, Japanese, Korean, German, etc., which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.