Paper: A Broad-Coverage Normalization System for Social Media Language

ACL ID P12-1109
Title A Broad-Coverage Normalization System for Social Media Language
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2012
Authors

Social media language contains huge amount and wide variety of nonstandard tokens, cre- ated both intentionally and unintentionally by the users. It is of crucial importance to nor- malize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system cov- erage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candi- dates. In this paper, we propose a cognitively- driven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the en- hanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and message- level using four SMS and Twitter data sets. R...