Paper: A Latent Variable Model for Geographic Lexical Variation

ACL ID D10-1124
Title A Latent Variable Model for Geographic Lexical Variation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2010

The rapid growth of geotagged social media raises new computational possibilities for in- vestigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are ren- dered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged mi- croblogs, our model recovers coherent top- ics and their regional variants, while identi- fying geographic areas of linguistic consis- tency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and super- vised topic models.