Paper: PPDB: The Paraphrase Database

ACL ID N13-1092
Title PPDB: The Paraphrase Database
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2013
Authors

We present the 1.0 release of our para- phrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million para- phrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which cap- ture many meaning-preserving syntactic trans- formations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similar- ity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our r...