Paper: Weblog Classification For Fast Splog Filtering: A URL Language Model Segmentation Approach

ACL ID N06-2035
Title Weblog Classification For Fast Splog Filtering: A URL Language Model Segmentation Approach
Venue Human Language Technologies
Session Short Paper
Year 2006
Authors

This paper shows that in the context of statistical weblog classification for splog filtering based on n-grams of tokens in the URL, further segmenting the URLs beyond the standard punctuation is help- ful. Many splog URLs contain phrases in which the words are glued together in order to avoid splog filtering techniques based on punctuation segmentation and unigrams. A technique which segments long tokens into the words forming the phrase is proposed and evaluated. The re- sulting tokens are used as features for a weblog classifier whose accuracy is sim- ilar to that of humans (78% vs.