Paper: Using Language Modeling to Select Useful Annotation Data

ACL ID N09-3005
Title Using Language Modeling to Select Useful Annotation Data
Venue HLT-NAACL Companion Volume: Student Research Workshop and Doctoral Consortium
Session
Year 2009
Authors

An annotation project typically has an abun- dant supply of unlabeled data that can be drawn from some corpus, but because the labeling process is expensive, it is helpful to pre-screen the pool of the candidate instances based on some criterion of future usefulness. In many cases, that criterion is to improve the presence of the rare classes in the data to be annotated. We propose a novel method for solving this problem and show that it com- pares favorably to a random sampling baseline and a clustering algorithm.