Paper: Elephant: Sequence Labeling for Word and Sentence Segmentation

ACL ID D13-1146
Title Elephant: Sequence Labeling for Word and Sentence Segmentation
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2013
Authors

Tokenization is widely regarded as a solved problem due to the high accuracy that rule- based tokenizers achieve. But rule-based tokenizers are hard to maintain and their rules language specific. We show that high- accuracy word and sentence segmentation can be achieved by using supervised sequence la- beling on the character level combined with unsupervised feature learning. We evalu- ated our method on three languages and ob- tained error rates of 0.27 ? (English), 0.35 ? (Dutch) and 0.76 ? (Italian) for our best mod- els. 1 An Elephant in the Room Tokenization, the task of segmenting a text into words and sentences, is often regarded as a solved problem in natural language processing (Dridan and Oepen, 2012), probably because many corpora are already in tokenized format. But like an ele...