Paper: Mining Wikipedia Revision Histories for Improving Sentence Compression

ACL ID P08-2035
Title Mining Wikipedia Revision Histories for Improving Sentence Compression
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2008
Authors

A well-recognized limitation of research on supervisedsentencecompressionis thedearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision his- tory of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitudelargerthan the standardlyused Ziff-Davis corpus. Using this newfound data, we propose a novel lexical- ized noisy channel model for sentence com- pression, achievingimprovedresults in gram- maticalityandcompressionratecriteriawitha slightdecreaseinimportance.