Paper: An Improved Hierarchical Bayesian Model of Language for Document Classification

ACL ID C08-1004
Title An Improved Hierarchical Bayesian Model of Language for Document Classification
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2008
Authors

This paper addresses the fundamental problem of document classification, and we focus attention on classification problems where the classes are mutually exclusive. In the course of the paper we advocate an approximate sampling distribution for word counts in documents, and demonstrate the model’s capacity to outperform both the simple multinomial and more recently proposed extensions on the classification task. We also compare the classifiers to a linear SVM, and show that provided certain conditions are met, the new model allows performance which exceeds that of the SVM and attains amongst the very best published results on the Newsgroups classification task.