Paper: Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

ACL ID P11-1135
Title Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

Resolving coordination ambiguity is a clas- sic hard problem. This paper looks at co- ordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Eu- roparl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a mil- lion words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associ- ations among the lexical items. The bilingual data is helpful when the ambiguity can be re- solved by the order of words in ...