Paper: Generalising and Normalising Distributional Contexts to Reduce Data Sparsity: Application to Medical Corpora

ACL ID W14-4801
Title Generalising and Normalising Distributional Contexts to Reduce Data Sparsity: Application to Medical Corpora
Venue CompuTerm International Workshop On Computational Terminology
Session
Year 2014
Authors

Vector space models implement the distributional hypothesis. They are based on the repetition of information occurring in the contexts of words to associate. However, these models suffer from a high number of dimensions and data sparsity in the matrix of contextual vectors. This is a major issue with specialised corpora that are of much smaller size and with much lower context frequencies. We tackle the problem of data sparsity on specialised texts and we propose a method that allows to make the matrix denser, by generalising and normalising distributional contexts. Generalisation gives better results with the Jaccard index, narrow sliding windows and relations of lexical inclusion. On the other hand, normalisation has no positive effect on the relation extraction, with any combination of ...