Paper: One-Class Clustering in the Text Domain

ACL ID D08-1005
Title One-Class Clustering in the Text Domain
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2008

Having seen a news title “Alba denies wedding reports”, how do we infer that it is primar- ily about Jessica Alba, rather than about wed- dings or reports? We probably realize that, in a randomly driven sentence, the word “Alba” is less anticipated than “wedding” or “reports”, which adds value to the word “Alba” if used. Such anticipation can be modeled as a ratio between an empirical probability of the word (in a given corpus) and its estimated proba- bility in general English. Aggregated over all words in a document, this ratio may be used as a measure of the document’s topicality. As- suming that the corpus consists of on-topic and off-topic documents (we call them the core and the noise), our goal is to determine which documents belong to the core. We pro- pose tw...