SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts

Amany Fashwan, Sameh Alansary


Abstract
This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-based disambiguation layer and Out Of Vocabulary (OOV) layer. The adopted syntactic disambiguation algorithms is concerned with detecting the case ending diacritics depending on a rule based approach simulating the shallow parsing technique. This will be achieved using an annotated corpus for extracting the Arabic linguistic rules, building the language models and testing the system output. This system is considered as a good trial of the interaction between rule-based approach and statistical approach, where the rules can help the statistics in detecting the right diacritization and vice versa. At this point, the morphological Word Error Rate (WER) is 4.56% while the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is 9.36%. The best WER is 14.78% compared to the best-published results, of (Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally, Rashwan, & Atiya, 2016); 13.70%.
Anthology ID:
W17-1311
Volume:
Proceedings of the Third Arabic Natural Language Processing Workshop
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
SEMITIC
Publisher:
Association for Computational Linguistics
Note:
Pages:
84–93
Language:
URL:
https://aclanthology.org/W17-1311
DOI:
10.18653/v1/W17-1311
Bibkey:
Cite (ACL):
Amany Fashwan and Sameh Alansary. 2017. SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 84–93, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts (Fashwan & Alansary, WANLP 2017)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/W17-1311.pdf