Paper: A Preliminary Look Into The Use Of Named Entity Information For Bioscience Text Tokenization

ACL ID N04-2007
Title A Preliminary Look Into The Use Of Named Entity Information For Bioscience Text Tokenization
Venue Human Language Technologies
Session Student Session
Year 2004
Authors

Tokenization in the bioscience domain is often difficult. New terms, technical terminology, and nonstandard orthography, all common in bioscience text, contribute to this difficulty. This paper will introduce the tasks of tokenization, normalization before introducing BAccHANT, a system built for bioscience text normalization. Casting tokenization / normalization as a problem of punctuation classification motivates using machine learning methods in the implementation of this system. The evaluation of BAccHANT's performance included error analysis of the system's performance inside and outside of named entities (NEs) from the GENIA corpus, which led to the creation of a normalization system trained solely on data from inside NEs, BAccHANT-N. Evaluation of this new system indicated that norm...