Paper: Extracting Parallel Phrases from Comparable Data

ACL ID W11-1209
Title Extracting Parallel Phrases from Comparable Data
Venue Building and Using Comparable Corpora
Year 2011

Mining parallel data from comparable corpora is a promising approach for overcoming the data sparseness in statistical machine trans- lation and other NLP applications. Even if two comparable documents have few or no parallel sentence pairs, there is still poten- tial for parallelism in the sub-sentential level. The ability to detect these phrases creates a valuable resource, especially for low-resource languages. In this paper we explore three phrase alignment approaches to detect paral- lel phrase pairs embedded in comparable sen- tences: the standard phrase extraction algo- rithm, which relies on the Viterbi path; a phrase extraction approach that does not rely on the Viterbi path, but uses only lexical fea- tures; and a binary classifier that detects par- allel phrase pairs when presen...