Paper: Fast Tweet Retrieval with Compact Binary Codes

ACL ID C14-1047
Title Fast Tweet Retrieval with Compact Binary Codes
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014

The most widely used similarity measure in the field of natural language processing may be co- sine similarity. However, in the context of Twitter, the large scale of massive tweet data inevitably makes it expensive to perform cosine similarity computations among tremendous data samples. In this paper, we exploit binary coding to tackle the scalability issue, which compresses each data sample into a compact binary code and hence enables highly efficient similarity computations via Hamming distances between the generated codes. In order to yield semantics sensitive binary codes for tweet data, we design a binarized matrix factorization model and further improve it in two aspects. First, we force the projection directions employed by the model nearly orthogonal to reduce the redundant inform...