Paper: Optimal Data Set Selection: An Application to Grapheme-to-Phoneme Conversion

ACL ID N13-1139
Title Optimal Data Set Selection: An Application to Grapheme-to-Phoneme Conversion
Venue Annual Conference of the North American Chapter of the Association for Computational Linguistics
Session Main Conference
Year 2013
Authors

In this paper we introduce the task of unla- beled, optimal, data set selection. Given a large pool of unlabeled examples, our goal is to select a small subset to label, which will yield a high performance supervised model over the entire data set. Our first proposed method, based on the rank-revealing QR ma- trix factorization, selects a subset of words which span the entire word-space effectively. For our second method, we develop the con- cept of feature coverage which we optimize with a greedy algorithm. We apply these methods to the task of grapheme-to-phoneme prediction. Experiments over a data-set of 8 languages show that in all scenarios, our selec- tion methods are effective at yielding a small, but optimal set of labelled examples. When fed into a state-of-the-art supervised mode...