Paper: Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora

ACL ID W13-2505
Title Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora
Venue Building and Using Comparable Corpora
Session
Year 2013
Authors

Parallel sentences are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese?Japanese. Many studies have been conducted on extracting parallel sentences from noisy parallel or compara- ble corpora. We extract Chinese?Japanese parallel sentences from quasi?comparable corpora, which are available in far larger quantities. The task is significantly more difficult than the extraction from noisy parallel or comparable corpora. We ex- tend a previous study that treats parallel sentence identification as a binary classifi- cation problem. Previous method of clas- sifier training by the Cartesian product is not practical, because it differs from the real process of parallel sentence extrac- tion. We propose a novel classifier tra...