Paper: Discovering Sociolinguistic Associations with Structured Sparsity

ACL ID P11-1137
Title Discovering Sociolinguistic Associations with Structured Sparsity
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011

We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate de- mographic statistics about the authors’ geo- graphic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a com- posite lscript1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict de- mographic attributes; our method identifies a compact set of words that are strongly asso- ciated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small num- ber of features, which corr...