Paper: Normalizing tweets with edit scripts and recurrent neural embeddings

ACL ID P14-2111
Title Normalizing tweets with edit scripts and recurrent neural embeddings
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014
Authors

Tweets often contain a large proportion of abbreviations, alternative spellings, novel words and other non-canonical language. These features are problematic for stan- dard language analysis tools and it can be desirable to convert them to canoni- cal form. We propose a novel text nor- malization model based on learning edit operations from labeled data while incor- porating features induced from unlabeled data via character-level neural text embed- dings. The text embeddings are generated using an Simple Recurrent Network. We find that enriching the feature set with text embeddings substantially lowers word er- ror rates on an English tweet normaliza- tion dataset. Our model improves on state- of-the-art with little training data and with- out any lexical resources.