Paper: Word Segmentation of Informal Arabic with Domain Adaptation

ACL ID P14-2034
Title Word Segmentation of Informal Arabic with Domain Adaptation
Venue Annual Meeting of the Association of Computational Linguistics
Session Main Conference
Year 2014

Segmentation of clitics has been shown to improve accuracy on a variety of Arabic NLP tasks. However, state-of-the-art Ara- bic word segmenters are either limited to formal Modern Standard Arabic, perform- ing poorly on Arabic text featuring dialectal vocabulary and grammar, or rely on lin- guistic knowledge that is hand-tuned for each dialect. We extend an existing MSA segmenter with a simple domain adapta- tion technique and new features in order to segment informal and dialectal Arabic text. Experiments show that our system outperforms existing systems on newswire, broadcast news and Egyptian dialect, im- proving segmentation F 1 score on a recently released Egyptian Arabic corpus to 95.1%, compared to 90.8% for another segmenter designed specifically for Egyptian Arabic.