Paper: Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM

ACL ID N10-1076
Title Automatic Diacritization for Low-Resource Languages Using a Hybrid Word and Consonant CMM
Venue Human Language Technologies
Session Main Conference
Year 2010
Authors

We are interested in diacritizing Semitic lan- guages, especially Syriac, using only dia- critized texts. Previous methodshave required theuseof toolssuch as part-of-speechtaggers, segmenters, morphologicalanalyzers, and lin- guisticrulestoproducestate-of-the-artresults. We present a low-resource, data-driven, and language-independent approach that uses a hybrid word- and consonant-level conditional Markov model. Our approach rivals the best previously published results in Arabic (15% WER with case endings), without the use of a morphological analyzer. In Syriac, we re- duce the WER over a strong baseline by 30% to achieve a WER of 10.5%. We also report results for Hebrew and English.