Paper: Supervised Grammar Induction Using Training Data With Limited Constituent Information

ACL ID P99-1010
Title Supervised Grammar Induction Using Training Data With Limited Constituent Information
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1999
Authors

Corpus-based grammar induction generally re- lies on hand-parsed training data to learn the structure of the language. Unfortunately, the cost of building large annotated corpora is pro- hibitively expensive. This work aims to improve the induction strategy when there are few labels in the training data. We show that the most in- formative linguistic constituents are the higher nodes in the parse trees, typically denoting com- plex noun phrases and sentential clauses. They account for only 20% of all constituents. For in- ducing grammars from sparsely labeled training data (e.g. , only higher-level constituent labels), we propose an adaptation strategy, which pro- duces grammars that parse almost as well as grammars induced from fully labeled corpora. Our results suggest that for a partial...