Paper: Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages

ACL ID W08-1403
Title Automatic Construction of Domain-specific Dictionaries on Sparse Parallel Corpora in the Nordic languages
Venue Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization
Session
Year 2008
Authors

Hallå Norden is a web site with information regarding mobility between the Nordic coun- tries in five different languages; Swedish, Danish, Norwegian, Icelandic and Finnish. We wanted to create a Nordic cross-language dictionary for the use in a cross-language search engine for Hallå Norden. The entire set of texts on the web site was treated as one multilingual parallel corpus. From this we extracted parallel corpora for each language pair. The corpora were very sparse, contain- ing on average less than 80 000 words per language pair. We have used the Uplug word alignment system (Tiedemann 2003a), for the creation of the dictionaries. The results gave on average 213 new dictionary words (fre- quency > 3) per language pair. The average error rate was 16 percent. Different...