Paper: Semantic-Based Multilingual Document Clustering via Tensor Modeling

ACL ID D14-1065
Title Semantic-Based Multilingual Document Clustering via Tensor Modeling
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2014
Authors

A major challenge in document clustering re- search arises from the growing amount of text data written in different languages. Previ- ous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequen- tial machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new docu- ment clustering approach for multilingual cor- pora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic docu- ment clustering ap...