Paper: Active Zipfian Sampling for Statistical Parser Training

ACL ID N09-2063
Title Active Zipfian Sampling for Statistical Parser Training
Venue Human Language Technologies
Session Short Paper
Year 2009

Active learning has proven to be a successful strategy in quick development of corpora to be used in training of statistical natural language parsers. A vast majority of studies in this field has focused on estimating informative- ness of samples; however, representativeness of samples is another important criterion to be considered in active learning. We present a novel metric for estimating representativeness of sentences, based on a modification of Zipf’s Principle of Least Effort. Experiments on WSJ corpus with a wide-coverage parser show that our method performs always at least as good as and generally significantly better than alternative representativeness-based methods.