Paper: A Simple Sentence-Level Extraction Algorithm for Comparable Data

ACL ID N09-2024
Title A Simple Sentence-Level Extraction Algorithm for Comparable Data
Venue Human Language Technologies
Session Short Paper
Year 2009
Authors

The paper presents a novel sentence pair ex- traction algorithm for comparable data, where a large set of candidate sentence pairs is scored directly at the sentence-level. The sentence- level extraction relies on a very efficient im- plementation of a simple symmetric scoring function: a computation speed-up by a fac- tor of 30 is reported. On Spanish-English data, the extraction algorithm finds the highest scoring sentence pairs from close to 1 trillion candidate pairs without search errors. Sig- nificant improvements in BLEU are reported by including the extracted sentence pairs into the training of a phrase-based SMT (Statistical Machine Translation) system.