Paper: Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples

ACL ID P09-1013
Title Knowing the Unseen: Estimating Vocabulary Size over Unseen Samples
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2009
Authors

Empirical studies on corpora involve mak- ing measurements of several quantities for the purpose of comparing corpora, creat- ing language models or to make general- izations about specific linguistic phenom- ena in a language. Quantities such as av- erage word length are stable across sam- ple sizes and hence can be reliably esti- mated from large enough samples. How- ever, quantities such as vocabulary size change with sample size. Thus measure- ments based on a given sample will need to be extrapolated to obtain their estimates over larger unseen samples. In this work, we propose a novel nonparametric estima- tor of vocabulary size. Our main result is to show the statistical consistency of the estimator – the first of its kind in the lit- erature. Finally, we compare our proposal with...