Paper: Mitigating The Paucity-Of-Data Problem: Exploring The Effect Of Training Corpus Size On Classifier Performance For Natural Language Processing

ACL ID H01-1052
Title Mitigating The Paucity-Of-Data Problem: Exploring The Effect Of Training Corpus Size On Classifier Performance For Natural Language Processing
Venue Human Language Technologies
Session Main Conference
Year 2001
Authors

In this paper, we discuss experiments applying machine learning techniques to the task of confusion set disambiguation, using three orders of magnitude more training data than has previously been used for any disambiguation-in-string-context problem. In an attempt to determine when current learning methods will cease to benefit from additional training data, we analyze residual errors made by learners when issues of sparse data have been significantly mitigated. Finally, in the context of our results, we discuss possible directions for the empirical natural language research community. Keywords Learning curves, data scaling, very large corpora, natural language disambiguation.