Paper: Collecting Highly Parallel Data for Paraphrase Evaluation

ACL ID P11-1020
Title Collecting Highly Parallel Data for Paraphrase Evaluation
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

A lack of standard datasets and evaluation metrics has prevented the field of paraphras- ing from making the kind of rapid progress enjoyed by the machine translation commu- nity over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lex- ical dissimilarity of paraphrase candidates. In addition to being simple and efficient to com- pute, experiments show that these metrics cor- relate highly with human judgments.