Paper: Learning Polylingual Topic Models from Code-Switched Social Media Documents

ACL ID P14-2110
Title Learning Polylingual Topic Models from Code-Switched Social Media Documents
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which in- fers language specific topic distributions based on code-switched documents to fa- cilitate multi-lingual corpus analysis. We experiment on two code-switching cor- pora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human anno- tators.