Paper: Multi-Level Bootstrapping For Extracting Parallel Sentences From A Quasi-Comparable Corpus

ACL ID C04-1151
Title Multi-Level Bootstrapping For Extracting Parallel Sentences From A Quasi-Comparable Corpus
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2004
Authors

We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better sentence matching also leads to better document matching. Based on this, we use multi-level bootstrapping to improve the alignments between documents, sentences, and bilingual word pairs, iteratively. Our method is the first method that does not rely on any supervised training data, such as a sentence-aligned corpus, or temporal information, such as the publishing date of a news article. It is validated by...