Paper: Fragments And Text Categorization

ACL ID P04-3034
Title Fragments And Text Categorization
Venue Annual Meeting of the Association of Computational Linguistics
Session System Demonstration
Year 2004

We introduce two novel methods of text categoriza- tion in which documents are split into fragments. We conducted experiments on English, French and Czech. In all cases, the problems referred to a bi- nary document classification. We find that both methods increase the accuracy of text categoriza- tion. For the Na¨ıve Bayes classifier this increase is significant. 1 Motivation In the process of automatic classifying documents into several predefined classes – text categorization (Sebastiani, 2002) – text documents are usually seen as sets or bags of all the words that have appeared in a document, maybe after removing words in a stop-list. In this paper we describe a novel approach to text categorization in which each documents is first split into subparts, called fragments. Each frag...