Paper: Language Model-Based Document Clustering Using Random Walks

ACL ID N06-1061
Title Language Model-Based Document Clustering Using Random Walks
Venue Human Language Technologies
Session Main Conference
Year 2006

We propose a new document vector represen- tation specifically designed for the document clustering task. Instead of the traditional term- based vectors, a document is represented as an a0-dimensional vector, where a0 is the number of documents in the cluster. The value at each di- mension of the vector is closely related to the generation probability based on the language model of the corresponding document. In- spired by the recent graph-based NLP methods, we reinforce the generation probabilities by it- erating random walks on the underlying graph representation. Experiments with k-means and hierarchical clustering algorithms show signif- icant improvements over the alternative a1a2 a3a4a5a2 vector representation.