Paper: Compacting the Penn Treebank Grammar

ACL ID P98-1115
Title Compacting the Penn Treebank Grammar
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 1998

Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more tree- banked text would be required to obtain a com- plete grammar, if one exists at some limit. However, we offer an alternative explanation in terms of the underspecification of structures within the treebank. This hypothesis is ex- plored by applying an algorithm to compact the derived grammar by eliminating redund- ant rules - rules whose right hand sides can be parsed by other rules. The size of the result- ing compacted grammar, which is ...