Paper: Tokenization As The Initial Phase In NLP

ACL ID C92-4173
Title Tokenization As The Initial Phase In NLP
Venue International Conference on Computational Linguistics
Session Main Conference
Year 1992

In this paper, the authors address the significance and complexity of tokenization, the beginning step of NLP. Notions of word and token are discussed and defined from the viewpoints of lexicography and pragmatic implementation, respectively. Automatic segmentation of Chinese words is presented as an illustration of tokenization. Practical approaches to identification of compound tokens in English, such as idioms, phrasal verbs and fixed expressions, are developed.