~~In this paper we present a supervisedWord Sense Disambiguation methodol ogy, that exploits kernel methods to modelsense distinctions.~~
~~In particular a combi nation of kernel functions is adopted to estimate independently both syntagmaticand domain similarity.~~
~~We defined a ker nel function, namely the Domain Kernel,that allowed us to plug ?external knowledge?~~
~~into the supervised learning pro cess.~~
~~External knowledge is acquired from unlabeled data in a totally unsupervisedway, and it is represented by means of Domain Models.~~
~~We evaluated our method ology on several lexical sample tasks indifferent languages, outperforming sig nificantly the state-of-the-art for each ofthem, while reducing the amount of la beled training data required for learning.~~
~~The main limitation of many supervised approaches for Natural Language Processing (NLP) is the lack of available annotated training data.~~
~~This problem is known as the Knowledge Acquisition Bottleneck.~~
~~To reach high accuracy, state-of-the-art systemsfor Word Sense Disambiguation (WSD) are designed according to a supervised learning frame work, in which the disambiguation of each wordin the lexicon is performed by constructing a different classifier.~~
~~A large set of sense tagged exam ples is then required to train each classifier.~~
~~This methodology is called word expert approach (Small, 1980; Yarowsky and Florian, 2002).~~
~~However this is clearly unfeasible for all-words WSD tasks, inwhich all the words of an open text should be dis ambiguated.~~
~~On the other hand, the word expert approach works very well for lexical sample WSD tasks (i.e. tasks in which it is required to disambiguate onlythose words for which enough training data is provided).~~
~~As the original rationale of the lexical sam ple tasks was to define a clear experimental settings to enhance the comprehension of WSD, they should be considered as preceding exercises to all-wordstasks.~~
~~However this is not the actual case.~~
~~Algo rithms designed for lexical sample WSD are often based on pure supervision and hence ?data hungry?.~~
~~We think that lexical sample WSD should regainits original explorative role and possibly use a minimal amount of training data, exploiting instead ex ternal knowledge acquired in an unsupervised way to reach the actual state-of-the-art performance.~~
~~By the way, minimal supervision is the basis of state-of-the-art systems for all-words tasks (e.g.~~
~~(Mihalcea and Faruque, 2004; Decadt et al, 2004)), that are trained on small sense tagged corpora (e.g. SemCor), in which few examples for a subset of the ambiguous words in the lexicon can be found.~~
~~Thus improving the performance of WSD systems with few learning examples is a fundamental step towards the direction of designing a WSD system that works well on real texts.In addition, it is a common opinion that the performance of state-of-the-art WSD systems is not sat isfactory from an applicative point of view yet.~~
~~403To achieve these goals we identified two promis ing research directions: 1.~~
~~Modeling independently domain and syntag-.~~
~~matic aspects of sense distinction, to improvethe feature representation of sense tagged ex amples (Gliozzo et al, 2004).~~
~~unlabeled corpora.~~
~~The first direction is motivated by the linguisticassumption that syntagmatic and domain (associa tive) relations are both crucial to represent sense distictions, while they are basically originated by very different phenomena.~~
~~Syntagmatic relations hold among words that are typically located closeto each other in the same sentence in a given tempo ral order, while domain relations hold among words that are typically used in the same semantic domain (i.e. in texts having similar topics (Gliozzo et al,2004)).~~
~~Their different nature suggests to adopt dif ferent learning strategies to detect them.Regarding the second direction, external knowl edge would be required to help WSD algorithms tobetter generalize over the data available for train ing.~~
~~On the other hand, most of the state-of-the-art supervised approaches to WSD are still completely based on ?internal?~~
~~information only (i.e. the only information available to the training algorithm is theset of manually annotated examples).~~
~~For example, in the Senseval-3 evaluation exercise (Mihal cea and Edmonds, 2004) many lexical sample tasks were provided, beyond the usual labeled training data, with a large set of unlabeled data.~~
~~However, at our knowledge, none of the participants exploited this unlabeled material.~~
~~Exploring this direction isthe main focus of this paper.~~
~~In particular we ac quire a Domain Model (DM) for the lexicon (i.e. a lexical resource representing domain associationsamong terms), and we exploit this information in side our supervised WSD algorithm.~~
~~DMs can beautomatically induced from unlabeled corpora, al lowing the portability of the methodology among languages.We identified kernel methods as a viable frame work in which to implement the assumptions above (Strapparava et al, 2004).Exploiting the properties of kernels, we have de fined independently a set of domain and syntagmatic kernels and we combined them in order to define acomplete kernel for WSD.~~
~~The domain kernels esti mate the (domain) similarity (Magnini et al, 2002)among contexts, while the syntagmatic kernels eval uate the similarity among collocations.~~
~~We will demonstrate that using DMs inducedfrom unlabeled corpora is a feasible strategy to increase the generalization capability of the WSD algorithm.~~
~~Our system far outperforms the state-of the-art systems in all the tasks in which it has been tested.~~
~~Moreover, a comparative analysis of the learning curves shows that the use of DMs allows us to remarkably reduce the amount of sense-taggedexamples, opening new scenarios to develop sys tems for all-words tasks with minimal supervision.The paper is structured as follows.~~
~~Section 2 in troduces the notion of Domain Model.~~
~~In particular an automatic acquisition technique based on Latent Semantic Analysis (LSA) is described.~~
~~In Section 3 we present a WSD system based on a combinationof kernels.~~
~~In particular we define a Domain Ker nel (see Section 3.1) and a Syntagmatic Kernel (see Section 3.2), to model separately syntagmatic and domain aspects.~~
~~In Section 4 our WSD system is evaluated in the Senseval-3 English, Italian, Spanish and Catalan lexical sample tasks.~~
~~2 Domain Models.~~
~~The simplest methodology to estimate the similar ity among the topics of two texts is to represent them by means of vectors in the Vector Space Model (VSM), and to exploit the cosine similarity.~~
~~More formally, let C = {t1, t2, . . .~~
~~, tn} be a corpus, let V = {w1, w2, . . .~~
~~, wk} be its vocabulary, let T be the k ? n term-by-document matrix representing C , such that ti,j is the frequency of word wi into the text tj . The VSM is a k-dimensional space Rk, in whichthe text tj ? C is represented by means of the vec tor ~tj such that the ith component of ~tj is ti,j. The similarity among two texts in the VSM is estimated by computing the cosine among them.~~
~~However this approach does not deal well with lexical variability and ambiguity.~~
~~For example the two sentences ?he is affected by AIDS?~~
~~and ?HIV is a virus?~~
~~do not have any words in common.~~
~~In the 404VSM their similarity is zero because they have or thogonal vectors, even if the concepts they expressare very closely related.~~
~~On the other hand, the sim ilarity between the two sentences ?the laptop has been infected by a virus?~~
~~and ?HIV is a virus?~~
~~would turn out very high, due to the ambiguity of the word virus.~~
~~To overcome this problem we introduce the notion of Domain Model (DM), and we show how to use it in order to define a domain VSM in which texts and terms are represented in a uniform way.~~
~~A DM is composed by soft clusters of terms.~~
~~Each cluster represents a semantic domain, i.e. a set ofterms that often co-occur in texts having similar topics.~~
~~A DM is represented by a k?k? rectangular ma trix D, containing the degree of association among terms and domains, as illustrated in Table 1.~~
~~MEDICINE COMPUTER SCIENCE HIV 1 0 AIDS 1 0 virus 0.5 0.5 laptop 0 1 Table 1: Example of Domain Matrix DMs can be used to describe lexical ambiguity and variability.~~
~~Lexical ambiguity is represented by associating one term to more than one domain,while variability is represented by associating dif ferent terms to the same domain.~~
~~For example theterm virus is associated to both the domain COMPUTER SCIENCE and the domain MEDICINE (ambi guity) while the domain MEDICINE is associated to both the terms AIDS and HIV (variability).~~
~~More formally, let D = {D1, D2, ..., Dk?}~~
~~be a set of domains, such that k?~~
~~ k. A DM is fully defined by a k?k? domain matrix D representing in each cell di,z the domain relevance of term wi with respect to the domain Dz . The domain matrix D is used to define a function D : Rk ? Rk?~~
~~, that maps the vectors ~tj expressed into the classical VSM, into the vectors ~t?j in the domain VSM.~~
~~D is defined by1 D(~tj) = ~tj(IIDFD) = ~t?j (1) 1In (Wong et al, 1985) the formula 1 is used to define a Generalized Vector Space Model, of which the Domain VSM is a particular instance.~~
~~where IIDF is a k ? k diagonal matrix such that iIDFi,i = IDF (wi), ~tj is represented as a row vector, and IDF (wi) is the Inverse Document Frequency of wi.~~
~~Vectors in the domain VSM are called DomainVectors (DVs).~~
~~DVs for texts are estimated by exploiting the formula 1, while the DV ~w?i, correspond ing to the word wi ? V is the ith row of the domain matrix D. To be a valid domain matrix such vectors should be normalized (i,e. ? ~w?i, ~w?i? = 1).~~
~~In the Domain VSM the similarity among DVs isestimated by taking into account second order rela tions among terms.~~
~~For example the similarity of the two sentences ?He is affected by AIDS?~~
~~and ?HIV is a virus?~~
~~is very high, because the terms AIDS, HIV and virus are highly associated to the domain MEDICINE.~~
~~A DM can be estimated from hand made lexical resources such as WORDNET DOMAINS (Magniniand Cavaglia`, 2000), or by performing a term clus tering process on a large corpus.~~
~~We think that the second methodology is more attractive, because it allows us to automatically acquire DMs for different languages.In this work we propose the use of Latent Seman tic Analysis (LSA) to induce DMs from corpora.~~
~~LSA is an unsupervised technique for estimating the similarity among texts and terms in a corpus.~~
~~LSAis performed by means of a Singular Value Decom position (SVD) of the term-by-document matrix T describing the corpus.~~
~~The SVD algorithm can be exploited to acquire a domain matrix D from a largecorpus C in a totally unsupervised way.~~
~~SVD de composes the term-by-document matrix T into three matrixes T ' V?k?UT where ?k? is the diagonalk ? k matrix containing the highest k ? k eigen values of T, and all the remaining elements set to 0.~~
~~The parameter k?~~
~~is the dimensionality of the Do-.~~
~~main VSM and can be fixed in advance2 . Under this setting we define the domain matrix DLSA as DLSA = INV ? ?k?~~
~~(2) where IN is a diagonal matrix such that iNi,i = 1 q ? ~w?i, ~w?i? , ~w?i is the ith row of the matrix V ? ?k? .3 2It is not clear how to choose the right dimensionality.~~
~~In our experiments we used 50 dimensions.~~
~~3When DLSA is substituted in Equation 1 the Domain VSM 405~~
~~In the introduction we discussed two promising directions for improving the performance of a super vised disambiguation system.~~
~~In this section weshow how these requirements can be efficiently implemented in a natural and elegant way by using ker nel methods.~~
~~The basic idea behind kernel methods is to embedthe data into a suitable feature space F via a mapping function ? : X ? F , and then use a linear al gorithm for discovering nonlinear patterns.~~
~~Instead of using the explicit mapping ?, we can use a kernel function K : X ? X ? R, that corresponds to the inner product in a feature space which is, in general, different from the input space.Kernel methods allow us to build a modular system, as the kernel function acts as an interface be tween the data and the learning algorithm.~~
~~Thusthe kernel function becomes the only domain specific module of the system, while the learning algo rithm is a general purpose component.~~
~~Potentially any kernel function can work with any kernel-basedalgorithm.~~
~~In our system we use Support Vector Ma chines (Cristianini and Shawe-Taylor, 2000).Exploiting the properties of the kernel func tions, it is possible to define the kernel combination schema as KC(xi, xj) = n ? l=1 Kl(xi, xj) ? Kl(xj, xj)Kl(xi, xi) (3) Our WSD system is then defined as combinationof n basic kernels.~~
~~Each kernel adds some addi tional dimensions to the feature space.~~
~~In particular, we have defined two families of kernels: Domain and Syntagmatic kernels.~~
~~The former is composedby both the Domain Kernel (KD) and the Bag-of Words kernel (KBoW ), that captures domain aspects(see Section 3.1).~~
~~The latter captures the syntag matic aspects of sense distinction and it is composed by two kernels: the collocation kernel (KColl) and is equivalent to a Latent Semantic Space (Deerwester et al, 1990).~~
~~The only difference in our formulation is that the vectors representing the terms in the Domain VSM are normalized by the matrix IN, and then rescaled, according to their IDF value, by matrix IIDF.~~
~~Note the analogy with the tf idf term weightingschema (Salton and McGill, 1983), widely adopted in Informa tion Retrieval.~~
~~the Part of Speech kernel (KPoS) (see Section 3.2).The WSD kernels (K ?WSD and KWSD) are then de fined by combining them (see Section 3.3).~~
~~3.1 Domain Kernels.~~
~~In (Magnini et al, 2002), it has been claimed that knowing the domain of the text in which the word is located is a crucial information for WSD.~~
~~Forexample the (domain) polysemy among the COM PUTER SCIENCE and the MEDICINE senses of the word virus can be solved by simply considering the domain of the context in which it is located.~~
~~This assumption can be modeled by defining a kernel that estimates the domain similarity among the contexts of the words to be disambiguated,namely the Domain Kernel.~~
~~The Domain Kernel es timates the similarity among the topics (domains) oftwo texts, so to capture domain aspects of sense distinction.~~
~~It is a variation of the Latent Semantic Ker nel (Shawe-Taylor and Cristianini, 2004), in which a DM (see Section 2) is exploited to define an explicit mapping D : Rk ? Rk?~~
~~from the classical VSM into the Domain VSM.~~
~~The Domain Kernel is defined by KD(ti, tj) = ?D(ti),D(tj)?~~
~~?D(ti),D(tj)??D(ti),D(tj)?~~
~~(4)where D is the Domain Mapping defined in equa tion 1.~~
~~Thus the Domain Kernel requires a DomainMatrix D. For our experiments we acquire the ma trix DLSA, described in equation 2, from a generic collection of unlabeled documents, as explained in Section 2.A more traditional approach to detect topic (do main) similarity is to extract Bag-of-Words (BoW) features from a large window of text around theword to be disambiguated.~~
~~The BoW kernel, de noted by KBoW , is a particular case of the DomainKernel, in which D = I, and I is the identity ma trix.~~
~~The BoW kernel does not require a DM, then it can be applied to the ?strictly?~~
~~supervised settings,in which an external knowledge source is not pro vided.~~
~~3.2 Syntagmatic kernels.~~
~~Kernel functions are not restricted to operate on vec torial objects ~x ? Rk.~~
~~In principle kernels can be defined for any kind of object representation, as for 406 example sequences and trees.~~
~~As stated in Section 1, syntagmatic relations hold among words collocatedin a particular temporal order, thus they can be mod eled by analyzing sequences of words.We identified the string kernel (or word se quence kernel) (Shawe-Taylor and Cristianini, 2004) as a valid instrument to model our assumptions.The string kernel counts how many times a (non contiguous) subsequence of symbols u of lengthn occurs in the input string s, and penalizes non contiguous occurrences according to the number ofgaps they contain (gap-weighted subsequence ker nel).~~
~~Formally, let V be the vocabulary, the feature space associated with the gap-weighted subsequencekernel of length n is indexed by a set I of subse quences over V of length n. The (explicit) mapping function is defined by ?nu(s) = ? i:u=s(i) ?l(i), u ? V n (5)where u = s(i) is a subsequence of s in the posi tions given by the tuple i, l(i) is the length spannedby u, and ? ?]0, 1] is the decay factor used to penal ize non-contiguous subsequences.~~
~~The associate gap-weighted subsequence kernel is defined by kn(si, sj) = ??n(si), ?n(sj)?~~
~~= X u?V n ?n(si)?n(sj) (6) We modified the generic definition of the stringkernel in order to make it able to recognize collocations in a local window of the word to be disam biguated.~~
~~In particular we defined two Syntagmatickernels: the n-gram Collocation Kernel and the ngram PoS Kernel.~~
~~The n-gram Collocation ker nel KnColl is defined as a gap-weighted subsequence kernel applied to sequences of lemmata around the word l0 to be disambiguated (i.e. l?3, l?2, l?1, l0,l+1, l+2, l+3).~~
~~This formulation allows us to esti mate the number of common (sparse) subsequencesof lemmata (i.e. collocations) between two exam ples, in order to capture syntagmatic similarity.~~
~~In analogy we defined the PoS kernel KnPoS , by setting s to the sequence of PoSs p?3, p?2, p?1, p0, p+1,p+2, p+3, where p0 is the PoS of the word to be dis ambiguated.~~
~~The definition of the gap-weighted subsequencekernel, provided by equation 6, depends on the parameter n, that represents the length of the sub sequences analyzed when estimating the similarity among sequences.~~
~~For example, K2Coll allows us torepresent the bigrams around the word to be disam biguated in a more flexible way (i.e. bigrams can be sparse).~~
~~In WSD, typical features are bigrams and trigrams of lemmata and PoSs around the word to be disambiguated, then we defined the Collocation Kernel and the PoS Kernel respectively by equations 7 and 84.~~
~~KColl(si, sj) = p ? l=1 K lColl(si, sj) (7) KPoS(si, sj) = p ? l=1 K lPoS(si, sj) (8) 3.3 WSD kernels.~~
~~In order to show the impact of using Domain Models in the supervised learning process, we defined two WSD kernels, by applying the kernel combination schema described by equation 3.~~
~~Thus the following WSD kernels are fully specified by the list of the kernels that compose them.~~
~~Kwsd composed by KColl, KPoS and KBoW K?wsd composed by KColl, KPoS , KBoW and KD The only difference between the two systems is that K ?wsd uses Domain Kernel KD.~~
~~K ?wsd exploits external knowledge, in contrast to Kwsd, whose only available information is the labeled training data.~~
~~In this section we present the performance of our kernel-based algorithms for WSD.~~
~~The objectives of these experiments are: ? to study the combination of different kernels, ? to understand the benefits of plugging external information using domain models, ? to verify the portability of our methodology among different languages.~~
~~4The parameters p and ? are optimized by cross-validation.~~
~~The best results are obtained setting p = 2, ? = 0.5 for KColl and ? ?~~
~~0 for KPoS . 407 4.1 WSD tasks.~~
~~We conducted the experiments on four lexical sam ple tasks (English, Catalan, Italian and Spanish)of the Senseval-3 competition (Mihalcea and Edmonds, 2004).~~
~~Table 2 describes the tasks by re porting the number of words to be disambiguated, the mean polysemy, and the dimension of training,test and unlabeled corpora.~~
~~Note that the organiz ers of the English task did not provide any unlabeled material.~~
~~So for English we used a domain modelbuilt from a portion of BNC corpus, while for Span ish, Italian and Catalan we acquired DMs from the unlabeled corpora made available by the organizers.~~
~~#w pol # train # test # unlab Catalan 27 3.11 4469 2253 23935English 57 6.47 7860 3944 Italian 45 6.30 5145 2439 74788 Spanish 46 3.30 8430 4195 61252 Table 2: Dataset descriptions 4.2 Kernel Combination.~~
~~In this section we present an experiment to em pirically study the kernel combination.~~
~~The basic kernels (i.e. KBoW , KD , KColl and KPoS) have been compared to the combined ones (i.e. Kwsd and K ?wsd) on the English lexical sample task.~~
~~The results are reported in Table 3.~~
~~The results show that combining kernels significantly improves the performance of the system.~~
~~KD KBoW KPoS KColl Kwsd K?wsd F1 65.5 63.7 62.9 66.7 69.7 73.3Table 3: The performance (F1) of each basic ker nel and their combination for English lexical sample task.~~
~~4.3 Portability and Performance.~~
~~We evaluated the performance of K ?wsd and Kwsd on the lexical sample tasks described above.~~
~~The results are showed in Table 4 and indicate that using DMs allowed K ?wsd to significantly outperform Kwsd.~~
~~In addition, K ?wsd turns out the best systems for all the tested Senseval-3 tasks.~~
~~Finally, the performance of K ?wsd are higher than the human agreement for the English and Spanish tasks5.Note that, in order to guarantee an uniform appli cation to any language, we do not use any syntactic information provided by a parser.~~
~~4.4 Learning Curves.~~
~~The Figures 1, 2, 3 and 4 show the learning curvesevaluated on K ?wsd and Kwsd for all the lexical sam ple tasks.The learning curves indicate that K ?wsd is far superior to Kwsd for all the tasks, even with few ex amples.~~
~~The result is extremely promising, for it demonstrates that DMs allow to drastically reducethe amount of sense tagged data required for learn ing.~~
~~It is worth noting, as reported in Table 5, that K ?wsd achieves the same performance of Kwsd using about half of the training data.~~
~~% of training English 54 Catalan 46 Italian 51 Spanish 50Table 5: Percentage of sense tagged examples re quired by K ?wsd to achieve the same performance of Kwsd with full training.~~
~~In this paper we presented a supervised algorithmfor WSD, based on a combination of kernel functions.~~
~~In particular we modeled domain and syn tagmatic aspects of sense distinctions by defining respectively domain and syntagmatic kernels.~~
~~The Domain kernel exploits Domain Models, acquired from ?external?~~
~~untagged corpora, to estimate thesimilarity among the contexts of the words to be dis ambiguated.~~
~~The syntagmatic kernels evaluate the similarity between collocations.We evaluated our algorithm on several Senseval3 lexical sample tasks (i.e. English, Spanish, Italian and Catalan) significantly improving the state-ot the-art for all of them.~~
~~In addition, the performance5It is not clear if the inter-annotator-agreement can be con siderated the upper bound for a WSD system.~~
~~408 MF Agreement BEST Kwsd K ?wsd DM+ English 55.2 67.3 72.9 69.7 73.3 3.6 Catalan 66.3 93.1 85.2 85.2 89.0 3.8 Italian 18.0 89.0 53.1 53.1 61.3 8.2 Spanish 67.7 85.3 84.2 84.2 88.2 4.0 Table 4: Comparative evaluation on the lexical sample tasks.~~
~~Columns report: the Most Frequent baseline, the inter annotator agreement, the F1 of the best system at Senseval-3, the F1 of Kwsd, the F1 of K ?wsd, DM+ (the improvement due to DM, i.e. K ?wsd ?Kwsd).~~
~~0.5 0.55 0.6 0.65 0.7 0.75 0 0.2 0.4 0.6 0.8 1 F1 Percentage of training set K'wsd K wsd Figure 1: Learning curves for English lexical sample task.~~
~~0.65 0.7 0.75 0.8 0.85 0.9 0 0.2 0.4 0.6 0.8 1 F1 Percentage of training set K'wsd K wsd Figure 2: Learning curves for Catalan lexical sample task.of our system outperforms the inter annotator agreement in both English and Spanish, achieving the up per bound performance.~~
~~We demonstrated that using external knowledge 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0 0.2 0.4 0.6 0.8 1 F1 Percentage of training set K'wsd K wsd Figure 3: Learning curves for Italian lexical sample task.~~
~~0.6 0.65 0.7 0.75 0.8 0.85 0.9 0 0.2 0.4 0.6 0.8 1 F1 Percentage of training set K'wsd K wsdFigure 4: Learning curves for Spanish lexical sam ple task.inside a supervised framework is a viable method ology to reduce the amount of training data required for learning.~~
~~In our approach the external knowledgeis represented by means of Domain Models automat 409ically acquired from corpora in a totally unsuper vised way.~~
~~Experimental results show that the use of Domain Models allows us to reduce the amountof training data, opening an interesting research direction for all those NLP tasks for which the Knowl edge Acquisition Bottleneck is a crucial problem.~~
~~In particular we plan to apply the same methodology toText Categorization, by exploiting the Domain Kernel to estimate the similarity among texts.~~
~~In this implementation, our WSD system does not exploit syntactic information produced by a parser.~~
~~For the fu ture we plan to integrate such information by adding a tree kernel (i.e. a kernel function that evaluates thesimilarity among parse trees) to the kernel combi nation schema presented in this paper.~~
~~Last but not least, we are going to apply our approach to develop supervised systems for all-words tasks, where the quantity of data available to train each word expert classifier is very low.~~
~~Acknowledgments Alfio Gliozzo and Carlo Strapparava were partiallysupported by the EU project Meaning (IST-2001 34460).~~
~~Claudio Giuliano was supported by the EU project Dot.Kom (IST-2001-34038).~~
~~We would like to thank Oier Lopez de Lacalle for useful comments.~~