Paper: Scalable Term Selection for Text Categorization

ACL ID D07-1081
Title Scalable Term Selection for Text Categorization
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2007

In text categorization, term selection is an important step for the sake of both cate- gorization accuracy and computational ef- ficiency. Different dimensionalities are ex- pected under different practical resource re- strictions of time or space. Traditionally in text categorization, the same scoring or ranking criterion is adopted for all target dimensionalities, which considers both the discriminability and the coverage of a term, such as χ2 or IG. In this paper, the poor ac- curacy at a low dimensionality is imputed to the small average vector length of the docu- ments. Scalable term selection is proposed to optimize the term set at a given dimen- sionality according to an expected average vector length. Discriminability and cover- age are separately measured; by adjusting the ratio ...