Paper: Non-parametric Bayesian Segmentation of Japanese Noun Phrases

ACL ID D11-1056
Title Non-parametric Bayesian Segmentation of Japanese Noun Phrases
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011

A key factor of high quality word segmenta- tion for Japanese is a high-coverage dictio- nary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation cri- teria. To supplement a morphological dictio- nary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For in- ference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a lo- cal optimum that is not too distant f...