Paper: Scaling Distributional Similarity To Large Corpora

ACL ID P06-1046
Title Scaling Distributional Similarity To Large Corpora
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2006

Accurately representing synonymy using distributional similarity requires large vol- umes of data to reliably represent infre- quent words. However, the na¨ıve nearest- neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size). In this paper, we compare several existing approaches to approximating the nearest- neighbour search for distributional simi- larity. We investigate the trade-off be- tween efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) pro- vides the best balance.