ACL ID P03-1037
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2003

It is well known that occurrence counts of words in documents are often mod- eled poorly by standard distributions like the binomial or Poisson. Observed counts vary more than simple models predict, prompting the use of overdispersed mod- els like Gamma-Poisson or Beta-binomial mixtures as robust alternatives. Another deficiency of standard models is due to the fact that most words never occur in a given document, resulting in large amounts of zero counts. We propose using zero- inflated models for dealing with this, and evaluate competing models on a Naive Bayes text classification task. Simple zero-inflated models can account for prac- tically relevant variation, and can be easier to work with than overdispersed models.