Paper: A Hierarchical Model of Web Summaries

ACL ID P11-2118
Title A Hierarchical Model of Web Summaries
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011

We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recur- sive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evalu- ate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.