Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results

Nouf Alabbasi, Mohamed Al-Badrashiny, Maryam Aldahmani, Ahmed AlDhanhani, Abdullah Saleh Alhashmi, Fawaghy Ahmed Alhashmi, Khalid Al Hashemi, Rama Emad Alkhobbi, Shamma T Al Maazmi, Mohammed Ali Alyafeai, Mariam M Alzaabi, Mohamed Saqer Alzaabi, Fatma Khalid Badri, Kareem Darwish, Ehab Mansour Diab, Muhammad Morsy Elmallah, Amira Ayman Elnashar, Ashraf Hatim Elneima, MHD Tameem Kabbani, Nour Rabih, Ahmad Saad, Ammar Mamoun Sousou


Abstract
Arabic diacritic recovery is important for a variety of downstream tasks such as text-to-speech. In this paper, we introduce a new Gulf Arabic diacritization dataset composed of 19,850 words based on a subset of the Gumar corpus. We provide comprehensive set of guidelines for diacritization to enable the diacritization of more data. We also report on diacritization results based on the new corpus using a Hidden Markov Model and character-based sequence to sequence models.
Anthology ID:
2022.wanlp-1.33
Volume:
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Houda Bouamor, Hend Al-Khalifa, Kareem Darwish, Owen Rambow, Fethi Bougares, Ahmed Abdelali, Nadi Tomeh, Salam Khalifa, Wajdi Zaghouani
Venue:
WANLP
SIG:
SIGARAB
Publisher:
Association for Computational Linguistics
Note:
Pages:
356–360
Language:
URL:
https://preview.aclanthology.org/build-pipeline-with-new-library/2022.wanlp-1.33/
DOI:
10.18653/v1/2022.wanlp-1.33
Bibkey:
Cite (ACL):
Nouf Alabbasi, Mohamed Al-Badrashiny, Maryam Aldahmani, Ahmed AlDhanhani, Abdullah Saleh Alhashmi, Fawaghy Ahmed Alhashmi, Khalid Al Hashemi, Rama Emad Alkhobbi, Shamma T Al Maazmi, Mohammed Ali Alyafeai, Mariam M Alzaabi, Mohamed Saqer Alzaabi, Fatma Khalid Badri, Kareem Darwish, Ehab Mansour Diab, Muhammad Morsy Elmallah, Amira Ayman Elnashar, Ashraf Hatim Elneima, MHD Tameem Kabbani, Nour Rabih, Ahmad Saad, and Ammar Mamoun Sousou. 2022. Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), pages 356–360, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Gulf Arabic Diacritization: Guidelines, Initial Dataset, and Results (Alabbasi et al., WANLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/build-pipeline-with-new-library/2022.wanlp-1.33.pdf