Paper: A Unified Tagging Approach to Text Normalization

This paper addresses the issue of text nor- malization, an important yet often over- looked problem in natural language proc- essing. By text normalization, we mean converting ‘informally inputted’ text into the canonical form, by eliminating ‘noises’ in the text and detecting paragraph and sen- tence boundaries in the text. Previously, text normalization issues were often under- taken in an ad-hoc fashion or studied sepa- rately. This paper first gives a formaliza- tion of the entire problem. It then proposes a unified tagging approach to perform the task using Conditional Random Fields (CRF). The paper shows that with the in- troduction of a small set of tags, most of the text normalization tasks can be per- formed within the approach. The accuracy of the proposed method is high,...