Paper: Investigating The Relationship Between Word Segmentation Performance And Retrieval Performance In Chinese IR

ACL ID C02-1148
Title Investigating The Relationship Between Word Segmentation Performance And Retrieval Performance In Chinese IR
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2002
Authors

It is commonly believed that word segmentation ac- curacy is monotonically related to retrieval perfor- mance in Chinese information retrieval. In this pa- per we show that, for Chinese, the relationship be- tween segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phe- nomenon begins to occur which leads to a reduction in information retrieval performance. We demon- strate this effect by presenting an empirical inves- tigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation al- gorithms with word segmentation accuracies ranging from 44% to 95%. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserv...