Houdaifa Atou
2026
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges.The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
2025
Phoenix at Palmx: Exploring Data Augmentation for Arabic Cultural Question Answering
Houdaifa Atou | Issam Ait Yahia | Ismail Berrada
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Houdaifa Atou | Issam Ait Yahia | Ismail Berrada
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Fakhraddin Alwajih | Abdellah El Mekki | Samar Mohamed Magdy | AbdelRahim A. Elmadany | Omer Nacar | El Moatez Billah Nagoudi | Reem Abdel-Salam | Hanin Atwany | Youssef Nafea | Abdulfattah Mohammed Yahya | Rahaf Alhamouri | Hamzah A. Alsayadi | Hiba Zayed | Sara Shatnawi | Serry Sibaee | Yasir Ech-chammakhy | Walid Al-Dhabyani | Marwa Mohamed Ali | Imen Jarraya | Ahmed Oumar El-Shangiti | Aisha Alraeesi | Mohammed Anwar AL-Ghrawi | Abdulrahman S. Al-Batati | Elgizouli Mohamed | Noha Taha Elgindi | Muhammed Saeed | Houdaifa Atou | Issam Ait Yahia | Abdelhak Bouayad | Mohammed Machrouh | Amal Makouar | Dania Alkawi | Mukhtar Mohamed | Safaa Taher Abdelfadil | Amine Ziad Ounnoughene | Anfel Rouabhia | Rwaa Assi | Ahmed Sorkatti | Mohamedou Cheikh Tourad | Anis Koubaa | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fakhraddin Alwajih | Abdellah El Mekki | Samar Mohamed Magdy | AbdelRahim A. Elmadany | Omer Nacar | El Moatez Billah Nagoudi | Reem Abdel-Salam | Hanin Atwany | Youssef Nafea | Abdulfattah Mohammed Yahya | Rahaf Alhamouri | Hamzah A. Alsayadi | Hiba Zayed | Sara Shatnawi | Serry Sibaee | Yasir Ech-chammakhy | Walid Al-Dhabyani | Marwa Mohamed Ali | Imen Jarraya | Ahmed Oumar El-Shangiti | Aisha Alraeesi | Mohammed Anwar AL-Ghrawi | Abdulrahman S. Al-Batati | Elgizouli Mohamed | Noha Taha Elgindi | Muhammed Saeed | Houdaifa Atou | Issam Ait Yahia | Abdelhak Bouayad | Mohammed Machrouh | Amal Makouar | Dania Alkawi | Mukhtar Mohamed | Safaa Taher Abdelfadil | Amine Ziad Ounnoughene | Anfel Rouabhia | Rwaa Assi | Ahmed Sorkatti | Mohamedou Cheikh Tourad | Anis Koubaa | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm.
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Fakhraddin Alwajih | Samar M. Magdy | Abdellah El Mekki | Omer Nacar | Youssef Nafea | Safaa Taher Abdelfadil | Abdulfattah Mohammed Yahya | Hamzah Luqman | Nada Almarwani | Samah Aloufi | Baraah Qawasmeh | Houdaifa Atou | Serry Sibaee | Hamzah A. Alsayadi | Walid Al-Dhabyani | Maged S. Al-shaibani | Aya El aatar | Nour Qandos | Rahaf Alhamouri | Samar Ahmad | Mohammed Anwar AL-Ghrawi | Aminetou Yacoub | Ruwa AbuHweidi | Vatimetou Mohamed Lemin | Reem Abdel-Salam | Ahlam Bashiti | Adel Ammar | Aisha Alansari | Ahmed Ashraf | Nora Alturayeif | Alcides Alcoba Inciarte | AbdelRahim A. Elmadany | Mohamedou Cheikh Tourad | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: EMNLP 2025
Fakhraddin Alwajih | Samar M. Magdy | Abdellah El Mekki | Omer Nacar | Youssef Nafea | Safaa Taher Abdelfadil | Abdulfattah Mohammed Yahya | Hamzah Luqman | Nada Almarwani | Samah Aloufi | Baraah Qawasmeh | Houdaifa Atou | Serry Sibaee | Hamzah A. Alsayadi | Walid Al-Dhabyani | Maged S. Al-shaibani | Aya El aatar | Nour Qandos | Rahaf Alhamouri | Samar Ahmad | Mohammed Anwar AL-Ghrawi | Aminetou Yacoub | Ruwa AbuHweidi | Vatimetou Mohamed Lemin | Reem Abdel-Salam | Ahlam Bashiti | Adel Ammar | Aisha Alansari | Ahmed Ashraf | Nora Alturayeif | Alcides Alcoba Inciarte | AbdelRahim A. Elmadany | Mohamedou Cheikh Tourad | Ismail Berrada | Mustafa Jarrar | Shady Shehata | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: EMNLP 2025
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce PEARL, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 37 annotators from across the Arab world, PEARL comprises over 309K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks (PEARL and PEARL-LITE) along with a specialized subset (PEARL-X) explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models’ cultural grounding compared to conventional scaling methods. PEARL establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abdellah El Mekki | Houdaifa Atou | Omer Nacar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Abdellah El Mekki | Houdaifa Atou | Omer Nacar | Shady Shehata | Muhammad Abdul-Mageed
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities.We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat.
2024
Addax at WojoodNER 2024: Attention-Based Dual-Channel Neural Network for Arabic Named Entity Recognition
Issam Yahia | Houdaifa Atou | Ismail Berrada
Proceedings of the Second Arabic Natural Language Processing Conference
Issam Yahia | Houdaifa Atou | Ismail Berrada
Proceedings of the Second Arabic Natural Language Processing Conference
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that focuses on extracting entities such as names of people, organizations, locations, and dates from text. Despite significant advancements due to deep learning and transformer architectures like BERT, NER still faces challenges, particularly in low-resource languages like Arabic. This paper presents a BERT-based NER system that utilizes a two-channel parallel hybrid neural network with an attention mechanism specifically designed for the NER Shared Task 2024. In the competition, our approach ranked second by scoring 90.13% in micro-F1 on the test set. The results demonstrate the effectiveness of combining advanced neural network architectures with contextualized word embeddings in improving NER performance for Arabic.
Search
Fix author
Co-authors
- Ismail Berrada 5
- Muhammad Abdul-Mageed 4
- Abdellah El Mekki 4
- Omer Nacar 4
- Walid Al-Dhabyani 3
- Rahaf Alhamouri 3
- Hamzah A. Alsayadi 3
- Fakhraddin Alwajih 3
- AbdelRahim A. Elmadany 3
- Mustafa Jarrar 3
- Samar Mohamed Magdy 3
- Youssef Nafea 3
- Shady Shehata 3
- Serry Sibaee 3
- Mohammed Anwar AL-Ghrawi 2
- Reem Abdel-Salam 2
- Safaa Taher Abdelfadil 2
- Ruwa AbuHweidi 2
- Ahlam Bashiti 2
- Aya El aatar 2
- Baraah Qawasmeh 2
- Sara Shatnawi 2
- Mohamedou Cheikh Tourad 2
- Issam Ait Yahia 2
- Abdulfattah Mohammed Yahya 2
- Lina Abureesh 1
- Samar Ahmad 1
- Abdulrahman S. Al-Batati 1
- Al-Yas Yaqoob Al-Ghafri 1
- Thikra Al-hibiri 1
- Amir Azad Adli Al-kathiri 1
- Maged S. Al-shaibani 1
- Aisha Alansari 1
- Ashwag Alasmari 1
- Hassan Alhuzali 1
- Marwa Mohamed Ali 1
- Dania Alkawi 1
- Alshima Mohammed Alkhazimi 1
- Nada Almarwani 1
- Samah Aloufi 1
- Aisha Alraeesi 1
- Omar Said Alshahri 1
- Nora Alturayeif 1
- Adel Ammar 1
- Alaa Aoun 1
- Aeej Mohammed Aseri 1
- Ahmed Ashraf 1
- Rwaa Assi 1
- Hanin Atwany 1
- Mohamad Ballout 1
- Anas Belfathi 1
- Abdelhak Bouayad 1
- Yahya Mohamed EL Hadj 1
- Yasir Ech-chammakhy 1
- Wesam El-Sayed 1
- Ahmed Oumar El-Shangiti 1
- Noha Taha Elgindi 1
- Khalid Elkhidir 1
- Brakehe Emehah 1
- Abdurrahman Gerrio 1
- Karim Ghaddar 1
- Abdulaziz Hafiz 1
- Emhemed S. Hamed 1
- Emira Hamedtou 1
- Nadia Ghezaiel Hammouda 1
- Aya Hamod 1
- Alcides Alcoba Inciarte 1
- Imen Jarraya 1
- Anis Koubaa 1
- Vatimetou Mohamed Lemin 1
- Hamzah Luqman 1
- Mohammed Machrouh 1
- Amal Makouar 1
- Elgizouli Mohamed 1
- Mukhtar Mohamed 1
- Yehdih Mohamed 1
- El-Moatez-Billah Nagoudi 1
- Amine Ziad Ounnoughene 1
- Nour Qandos 1
- Anfel Rouabhia 1
- Razan Saadie 1
- Muhammed Saeed 1
- Ahmed Sorkatti 1
- Aminetou Yacoub 1
- Issam Yahia 1
- Majdal Yousef 1
- Fadi A. Zaraket 1
- Hiba Zayed 1
- Asila Ismail al Sharji 1