Paper: Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

ACL ID P10-1130
Title Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2010
Authors

We show how web mark-up can be used to improve unsupervised dependency pars- ing. Starting from raw bracketings of four common HTML tags (anchors, bold, ital- ics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion proce- dures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived con- straints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: pars- ing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale exper- iments show that the DMV, perhaps be- cause it is unlexicalized, does not benefit from orders of mag...