Paper: PPDB: The Paraphrase Database

ACL ID N13-1092
Title PPDB: The Paraphrase Database
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2013

We present the 1.0 release of our para- phrase database, PPDB. Its English portion, PPDB:Eng, contains over 220 million para- phrase pairs, consisting of 73 million phrasal and 8 million lexical paraphrases, as well as 140 million paraphrase patterns, which cap- ture many meaning-preserving syntactic trans- formations. The paraphrases are extracted from bilingual parallel corpora totaling over 100 million sentence pairs and over 2 billion English words. We also release PPDB:Spa, a collection of 196 million Spanish paraphrases. Each paraphrase pair in PPDB contains a set of associated scores, including paraphrase probabilities derived from the bitext data and a variety of monolingual distributional similar- ity scores computed from the Google n-grams and the Annotated Gigaword corpus. Our r...