Paper: Improving Bilingual Projections via Sparse Covariance Matrices

ACL ID D11-1086
Title Improving Bilingual Projections via Sparse Covariance Matrices
Venue Conference on Empirical Methods in Natural Language Processing
Session Main Conference
Year 2011

Mapping documents into an interlingual rep- resentation can help bridge the language bar- rier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, repre- sented as a covariance matrix. In theory, such a covariance matrix should represent seman- tic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores...