Paper: Approximate Scalable Bounded Space Sketch for Large Data NLP

ACL ID D11-1023
Title Approximate Scalable Bounded Space Sketch for Large Data NLP
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011
Authors

We exploit sketch techniques, especially the Count-Min sketch, a memory, and time effi- cient framework which approximates the fre- quency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply Count- Min sketch to approximate word pair counts and exhibit their effectiveness on three im- portant NLP tasks. Our experiments demon- strate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).