Paper: Discovering Diverse and Salient Threads in Document Collections

ACL ID D12-1065
Title Discovering Diverse and Salient Threads in Document Collections
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2012
Authors

We propose a novel probabilistic technique for modeling and extracting salient struc- ture from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of infor- mation. We are particularly interested in revealing and exploiting relationships be- tween documents. To this end, we focus on extracting diverse sets of threads?singly- linked, coherent chains of important doc- uments. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human...