Paper: Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

ACL ID P11-1038
Title Lexical Normalisation of Short Text Messages: Makn Sens a #twitter
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identify- ing and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction can- didate for the word. The proposed method doesn’t require any annotations, and achieves state-of-the-art performance over an SMS cor- pus and a novel dataset based on Twitter.