Abstract
In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization.- Anthology ID:
- W17-1321
- Volume:
- Proceedings of the Third Arabic Natural Language Processing Workshop
- Month:
- April
- Year:
- 2017
- Address:
- Valencia, Spain
- Editors:
- Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
- Venue:
- WANLP
- SIG:
- SEMITIC
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 177–184
- Language:
- URL:
- https://aclanthology.org/W17-1321
- DOI:
- 10.18653/v1/W17-1321
- Cite (ACL):
- Mohamed Al-Badrashiny, Abdelati Hawwari, and Mona Diab. 2017. A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 177–184, Valencia, Spain. Association for Computational Linguistics.
- Cite (Informal):
- A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic (Al-Badrashiny et al., WANLP 2017)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/W17-1321.pdf