Paper: Insertion Deletion or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

ACL ID P11-2013
Title Insertion Deletion or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011
Authors

Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are de- signed to tackle each type. In this paper, we propose a unified letter transformation ap- proach that requires neither pre-categorization nor human supervision. Our approach mod- els the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. To avoid the expensive and time consuming hand label- ingprocess, weautomaticallycollectedalarge set of noisy training pairs using a novel web- based approach and performed charac...