A System for Diacritizing Four Varieties of Arabic
Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Mohamed Eldesouki, Younes Samih, Hassan Sajjad
Abstract
Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-speech applications. In this paper, we present a system for diacritizing MSA, CA, and two varieties of DA, namely Moroccan and Tunisian. The system uses a character level sequence-to-sequence deep learning model that requires no feature engineering and beats all previous SOTA systems for all the Arabic varieties that we test on.- Anthology ID:
- D19-3037
- Volume:
- Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- Sebastian Padó, Ruihong Huang
- Venues:
- EMNLP | IJCNLP
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 217–222
- Language:
- URL:
- https://aclanthology.org/D19-3037
- DOI:
- 10.18653/v1/D19-3037
- Cite (ACL):
- Hamdy Mubarak, Ahmed Abdelali, Kareem Darwish, Mohamed Eldesouki, Younes Samih, and Hassan Sajjad. 2019. A System for Diacritizing Four Varieties of Arabic. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 217–222, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- A System for Diacritizing Four Varieties of Arabic (Mubarak et al., EMNLP-IJCNLP 2019)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/D19-3037.pdf