Paper: Multilingual Document Clustering: An Heuristic Approach Based On Cognate Named Entities

ACL ID P06-1144
Title Multilingual Document Clustering: An Heuristic Approach Based On Cognate Named Entities
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2006
Authors

This paper presents an approach for Mul- tilingual Document Clustering in compa- rable corpora. The algorithm is of heuris- tic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main ad- vantages of this approach is that it does not depend on bilingual or multilingual re- sources. However, it depends on the pos- sibility of identifying cognate named enti- ties between the languages used in the cor- pus. An additional advantage of the ap- proach is that it does not need any infor- mation about the right number of clusters; the algorithm calculates it. We have tested this approach with a comparable corpus of news written in English and Spanish. In addition, we have compared the results wi...