Paper: Learning Comparable Corpora from Latent Semantic Analysis Simplified Document Space

ACL ID W13-2516
Title Learning Comparable Corpora from Latent Semantic Analysis Simplified Document Space
Venue Building and Using Comparable Corpora
Session
Year 2013
Authors

Focusing on a systematic Latent Semantic Analysis (LSA) and Machine Learning (ML) approach, this research contributes to the de- velopment of a methodology for the automatic compilation of comparable collections of doc- uments. Its originality lies within the delinea- tion of relevant comparability characteristics of similar documents in line with an estab- lished definition of comparable corpora. These innovative characteristics are used to build a LSA vector-based representation of the texts. In accordance with this new reduced in dimen- sionality document space, an unsupervised machine learning algorithm gathers similar texts into comparable clusters. On a monolin- gual collection of less than 100 documents, the proposed approach assigns comparable docu- ments to different com...