Paper: High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

ACL ID P13-3020
Title High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning
Venue Annual Meeting of the Association of Computational Linguistics
Session Student Session
Year 2013
Authors

In a multi-class document categorization using graph-based semi-supervised learn- ing (GBSSL), it is essential to construct a proper graph expressing the relation among nodes and to use a reasonable cat- egorization algorithm. Furthermore, it is also important to provide high-quality cor- rect data as training data. In this con- text, we propose a method to construct a similarity graph by employing both sur- face information and latent information to express similarity between nodes and a method to select high-quality training data for GBSSL by means of the PageR- ank algorithm. Experimenting on Reuters- 21578 corpus, we have confirmed that our proposed methods work well for raising the accuracy of a multi-class document categorization.