Paper: Automatic training of lemmatization rules that handle morphological changes in pre- in- and suffixes alike

ACL ID P09-1017
Title Automatic training of lemmatization rules that handle morphological changes in pre- in- and suffixes alike
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2009
Authors
  • Bart Jongejan (University of Copenhagen, Copenhagen Denmark)
  • Hercules Dalianis (Royal Institute of Technology, Stockholm Sweden; Euroling AB, Stockholm Sweden)

We propose a method to automatically train lemmatization rules that handle prefix, infix and suffix changes to generate the lemma from the full form of a word. We explain how the lemmatization rules are created and how the lemmatizer works. We trained this lemmatizer on Danish, Dutch, English, German, Greek, Icelandic, Norwegian, Polish, Slovene and Swedish full form-lemma pairs respectively. We obtained significant improvements of 24 percent for Polish, 2.3 percent for Dutch, 1.5 percent for English, 1.2 percent for German and 1.0 percent for Swedish compared to plain suffix lemmatization using a suffix-only lem- matizer. Icelandic deteriorated with 1.9 per- cent. We also made an observation regarding the number of produced lemmatization rules as a function of the number o...