Paper: Automatic Extraction Of New Words From Japanese Texts Using Generalized Forward-Backward Search

ACL ID W96-0205
Title Automatic Extraction Of New Words From Japanese Texts Using Generalized Forward-Backward Search
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 1996
Authors
  • Masaaki Nagata (NTT Information and Communication Systems Laboratories, Yokosuka Japan)

We present a novel new word extraction method from Japanese texts based on expected word frequencies. First, we compute expected word frequencies from Japanese texts using a robust stochastic N-best word segmenter. We then ex- tract new words by filtering out erroneous word hypotheses whose expected word frequencies are lower than the predefined threshold. The method is derived from an approximation of the general- ized version of the Forward-Backward algorithm. When the Japanese word segmenter is trained on a 4.7 million word segmented corpus and tested on 1000 sentences whose out-of-vocabulary rate is 2.1%, the accuracy of the new word extraction method is 43.7% recall and 52.3% precision.