Paper: Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition

ACL ID A94-1030
Title Improving Chinese Tokenization With Linguistic Filters On Statistical Lexical Acquisition
Venue Applied Natural Language Processing Conference
Session Main Conference
Year 1994
Authors
  • Dekai Wu (University of Science and Technology, Clear Water Bay Hong Kong)
  • Pascale Fung (Columbia University, New York NY)

The first step in Chinese NLP is to tokenize or segment char- acter sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g. , Chang &Chen 1993; Chiang et al. 1992; Linet al. 1993; Wu & Tseng 1993; Sproat et al. 1994). We present empirical evidence for four points concern- ing tokenization of Chinese text: (I) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in re- alistic domains. (3) S...