Paper: Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit —

ACL ID P12-2074
Title Tokenization: Returning to a Long Solved Problem — A Survey, Contrastive Experiment, Recommendations, and Toolkit —
Venue Annual Meeting of the Association of Computational Linguistics
Session Short Paper
Year 2012
Authors

We examine some of the frequently disre- garded subtleties of tokenization in Penn Tree- bank style, and present a new rule-based pre- processing toolkit that not only reproduces the Treebank tokenization with unmatched accu- racy, but also maintains exact stand-off point- ers to the original text and allows flexible con- figuration to diverse use cases (e.g. to genre- or domain-specific idiosyncrasies).