Omar Said Alshahri
2026
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges.The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Automatic Speech Recognition (ASR) has achieved strong performance in high-resource languages; however, Dialectal Arabic remains significantly under-resourced. This gap is particularly evident in Oman, where Arabic exhibits substantial sociolinguistic variation shaped by settlement patterns between sedentary (Hadari) and nomadic (Badu) communities, which are often overlooked by urban-centric or generalized Gulf Arabic datasets. We introduce OMAN-SPEECH, a sociolinguistically stratified spoken corpus for Omani Arabic comprising approximately 40 hours of spontaneous and semi-spontaneous speech from 32 speakers across 11 Wilayats (provinces). The corpus is balanced to capture regional and lifestyle variation and is annotated at the sentence level with Arabic transcription, English translation, and phonetic transcription using the International Phonetic Alphabet (IPA) through a human-in-the-loop annotation pipeline. OMAN-SPEECH provides a foundational resource for evaluating ASR and related speech technologies on Omani and Gulf Arabic varieties and supports more granular modeling of regional dialectal variation.
2024
Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi | Omar Said Alshahri | Hend Al-Khalifa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Norah A. Alrashoudi | Omar Said Alshahri | Hend Al-Khalifa
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Many under-resourced languages lack computational resources for automatic speech recognition (ASR) due to data scarcity issues. This makes developing accurate ASR models challenging. Shehri or Jibbali, spoken in Oman, lacks extensive annotated speech data. This paper aims to improve an ASR model for this under-resourced language. We collected a Shehri (Jibbali) speech corpus and utilized transfer learning by fine-tuning pre-trained ASR models on this dataset. Specifically, models like Wav2Vec2.0, HuBERT and Whisper were fine-tuned using techniques like parameter-efficient fine-tuning. Evaluation using word error rate (WER) and character error rate (CER) showed that the Whisper model, fine-tuned on the Shehri (Jibbali) dataset, significantly outperformed other models, with the best results from Whisper-medium achieving 3.5% WER. This demonstrates the effectiveness of transfer learning for resource-constrained tasks, showing high zero-shot performance of pre-trained models.
Search
Fix author
Co-authors
- Muhammad Abdul-Mageed 1
- Ruwa AbuHweidi 1
- Lina Abureesh 1
- Rayyan S. Al Khadhuri 1
- Firas Al Mahrouqi 1
- Salim Al Mandhari 1
- Walid Al-Dhabyani 1
- Al-Yas Yaqoob Al-Ghafri 1
- Amir Azad Al-Kathiri 1
- Hend Al-Khalifa 1
- Thikra Al-hibiri 1
- Amir Azad Adli Al-kathiri 1
- Ashwag Alasmari 1
- Rahaf Alhamouri 1
- Hassan Alhuzali 1
- Alshima Mohammed Alkhazimi 1
- Norah A. Alrashoudi 1
- Ghassab Mansoor Alsaqr 1
- Hamzah A. Alsayadi 1
- Fakhraddin Alwajih 1
- Alaa Aoun 1
- Aeej Mohammed Aseri 1
- Houdaifa Atou 1
- Mohamad Ballout 1
- Ahlam Bashiti 1
- Anas Belfathi 1
- Ismail Berrada 1
- Yahya Mohamed EL Hadj 1
- Abdellah El Mekki 1
- Aya El aatar 1
- Wesam El-Sayed 1
- Khalid Elkhidir 1
- AbdelRahim A. Elmadany 1
- Brakehe Emehah 1
- Tarek Fatnassi 1
- Abdurrahman Gerrio 1
- Karim Ghaddar 1
- Abdulaziz Hafiz 1
- Emhemed S. Hamed 1
- Emira Hamedtou 1
- Nadia Ghezaiel Hammouda 1
- Aya Hamod 1
- Mustafa Jarrar 1
- Samar Mohamed Magdy 1
- Yehdih Mohamed 1
- Badri Abdulhakim Mudhsh 1
- Omer Nacar 1
- Youssef Nafea 1
- Baraah Qawasmeh 1
- Razan Saadie 1
- Sara Shatnawi 1
- Serry Sibaee 1
- Majdal Yousef 1
- Fadi A. Zaraket 1
- Asila Ismail al Sharji 1