Paper: Semi-Supervised SimHash for Efficient Document Similarity Search

ACL ID P11-1010
Title Semi-Supervised SimHash for Efficient Document Similarity Search
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2011

Searching documents that are similar to a query document is an important component in modern information retrieval. Some ex- isting hashing methods can be used for effi- cient document similarity search. However, unsupervised hashing methods cannot incor- porate prior knowledge for better hashing. Although some supervised hashing methods can derive effective hash functions from prior knowledge, they are either computationally expensive or poorly discriminative. This pa- per proposes a novel (semi-)supervised hash- ing method named Semi-Supervised SimHash (S3H) for high-dimensional data similarity search. The basic idea of S3H is to learn the optimal feature weights from prior knowledge to relocate the data such that similar data have similar hash codes. We evaluate our method with several ...