Paper: Unsupervised Construction Of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources

ACL ID C04-1051
Title Unsupervised Construction Of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2004
Authors

We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, ...