Paper: Unknown Word Extraction For Chinese Documents

Venue International Conference on Computational Linguistics
Session Main Conference
Year 2002

There is no blank to mark word boundaries in Chinese text. As a result, identifying words is difficult, because of segmentation ambiguities and occurrences of unknown words. Conventionally unknown words were extracted by statistical methods because statistical methods are simple and efficient. However the statistical methods without using linguistic knowledge suffer the drawbacks of low precision and low recall, since character strings with statistical significance might be phrases or partial phrases instead of words and low frequency new words are hardly identifiable by statistical methods. In addition to statistical information, we try to use as much information as possible, such as morphology, syntax, semantics, and world knowledge. The identification system fully utilizes the context a...