Paper: Generalizing a Strongly Lexicalized Parser using Unlabeled Data

ACL ID E14-1014
Title Generalizing a Strongly Lexicalized Parser using Unlabeled Data
Venue Annual Meeting of The European Chapter of The Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Statistical parsers trained on labeled data suffer from sparsity, both grammatical and lexical. For parsers based on strongly lexicalized grammar formalisms (such as CCG, which has complex lexical cate- gories but simple combinatory rules), the problem of sparsity can be isolated to the lexicon. In this paper, we show that semi-supervised Viterbi-EM can be used to extend the lexicon of a generative CCG parser. By learning complex lexical entries for low-frequency and unseen words from unlabeled data, we obtain improvements over our supervised model for both in- domain (WSJ) and out-of-domain (ques- tions and Wikipedia) data. Our learnt lexicons when used with a discriminative parser such as C&C also significantly im- prove its performance on unseen words.