Paper: Cross-Dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora

ACL ID W02-2009
Title Cross-Dataset Clustering: Revealing Corresponding Themes Across Multiple Corpora
Venue International Conference on Computational Natural Language Learning
Session Main Conference
Year 2002
Authors
  • Ido Dagan (Bar Ilan University, Ramat Gan Israel)
  • Zvika Marx (Hebrew University of Jerusalem, Jerusalem Israel; Bar Ilan University, Ramat Gan Israel)
  • Eli Shamir (Hebrew University of Jerusalem, Jerusalem Israel)

We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information- bottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.