ACL ID N10-1070
Title Term Weighting Schemes for Latent Dirichlet Allocation
Venue Human Language Technologies
Session Main Conference
Year 2010

ManyimplementationsofLatentDirichletAl- location (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are as- sumed to contribute little to the meaning of thetext. Thisstepisconsiderednecessarybe- causeotherwisehigh-frequencywordstendto endupscatteredacrossmanyofthelatenttop- ics without much rhyme or reason. We show, however,thatthe‘problem’ofhigh-frequency words can be dealt with more elegantly, and in a way that to our knowledge has not been consideredinLDA,throughtheuseofappro- priateweightingschemescomparabletothose sometimes used in Latent Semantic Indexing (LSI). Our proposed weighting methods not only make theoretical sense, but can also be showntoimproveprecisionsignificantlyona non-trivialcross-languageretrievalta...