Mohamad Ballout
2026
Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdellah EL Mekki | Samar M. Magdy | Houdaifa Atou | Ruwa AbuHweidi | Baraah Qawasmeh | Omer Nacar | Thikra Al-hibiri | Razan Saadie | Hamzah A. Alsayadi | Nadia Ghezaiel Hammouda | Alshima Mohammed Alkhazimi | Aya Hamod | Al-Yas Yaqoob Al-Ghafri | Wesam El-Sayed | Asila Ismail al Sharji | Mohamad Ballout | Anas Belfathi | Karim Ghaddar | Serry Sibaee | Alaa Aoun | Aeej Mohammed Aseri | Lina Abureesh | Ahlam Bashiti | Majdal Yousef | Abdulaziz Hafiz | Yehdih Mohamed | Emira Hamedtou | Brakehe Emehah | Rahaf Alhamouri | Youssef Nafea | Aya El Aatar | Walid Al-Dhabyani | Emhemed S. Hamed | Sara Shatnawi | Fakhraddin Alwajih | Khalid Elkhidir | Ashwag Alasmari | Abdurrahman Gerrio | Omar Said Alshahri | AbdelRahim A. Elmadany | Ismail Berrada | Amir Azad Adli Al-kathiri | Fadi Zaraket | Mustafa Jarrar | Yahya Mohamed EL Hadj | Hassan Alhuzali | Muhammad Abdul-Mageed
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic (MSA). Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce Alexandria, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of parallel English-Dialectal Arabic multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total turns, Alexandria serves as both a training resource and as a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation benchmarks the current capabilities of Arabic-aware LLMs in translating across diverse Arabic dialects and sub-dialects while exposing significant persistent challenges.The Alexandria dataset, the creation prompts, the translation and revision guidelines, and the evaluation code are publicly available in the following repository: https://github.com/UBC-NLP/Alexandria
2025
Can you SPLICE it together? A Human Curated Benchmark for Probing Visual Reasoning in VLMs
Mohamad Ballout | Okajevo Wilfred | Seyedalireza Yaghoubi | Nohayr Muhammad Abdelmoneim | Julius Mayer | Elia Bruni
Findings of the Association for Computational Linguistics: EMNLP 2025
Mohamad Ballout | Okajevo Wilfred | Seyedalireza Yaghoubi | Nohayr Muhammad Abdelmoneim | Julius Mayer | Elia Bruni
Findings of the Association for Computational Linguistics: EMNLP 2025
In this work, we introduce SPLICE, a human-curated benchmark derived from the COIN instructional video dataset, designed to probe event-based reasoning across multiple dimensions: temporal, causal, spatial, contextual, and general knowledge. SPLICE includes 3,381 human-filtered videos spanning 12 categories and 180 sub-categories, such as sports, engineering, and housework. These videos are segmented into a total of 11,423 event clips. We evaluate both human participants and state-of-the-art vision-language models (VLMs) on the task of rearranging these clips into coherent event sequences to assess visual reasoning capabilities. Results reveal a significant gap: VLMs struggle to match human performance. While human-annotated textual descriptions improve model accuracy, they do not affect human performance, suggesting that models rely more on language priors than on visual understanding. Even with annotations, VLMs fall short of human-level reasoning, underscoring persistent challenges in visual reasoning. A deeper analysis across sub-categories shows that VLMs perform relatively better on videos where temporal and causal reasoning are dominant, compared to those where contextual and spatial reasoning are dominant. They also perform better on everyday tasks than on specialized ones.
iVISPAR — An Interactive Visual-Spatial Reasoning Benchmark for VLMs
Julius Mayer | Mohamad Ballout | Serwan Jassim | Farbod Nosrat Nezami | Elia Bruni
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Julius Mayer | Mohamad Ballout | Serwan Jassim | Farbod Nosrat Nezami | Elia Bruni
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. To help overcome these limitations, we introduce iVISPAR, an interactive multimodal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents. iVISPAR is based on a variant of the sliding tile puzzle—a classic problem that demands logical planning, spatial awareness, and multi-step reasoning. The benchmark supports visual 3D, 2D, and text-based input modalities, enabling comprehensive assessments of VLMs’ planning and reasoning skills. We evaluate a broad suite of state-of-the-art open-source and closed-source VLMs, comparing their performance while also providing optimal path solutions and a human baseline to assess the task’s complexity and feasibility for humans. Results indicate that while VLMs perform better on 2D tasks compared to 3D or text-based settings, they struggle with complex spatial configurations and consistently fall short of human performance, illustrating the persistent challenge of visual alignment. This underscores critical gaps in current VLM capabilities, highlighting their limitations in achieving human-level cognition. Project website: https://microcosm.ai/ivispar.
Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering
Muhammad Abu Ahmad | Mohamad Ballout | Raia Abu Ahmad | Elia Bruni
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Muhammad Abu Ahmad | Mohamad Ballout | Raia Abu Ahmad | Elia Bruni
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
2024
FOOL ME IF YOU CAN! An Adversarial Dataset to Investigate the Robustness of LMs in Word Sense Disambiguation
Mohamad Ballout | Anne Dedert | Nohayr Muhammad Abdelmoneim | Ulf Krumnack | Gunther Heidemann | Kai-Uwe Kühnberger
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Mohamad Ballout | Anne Dedert | Nohayr Muhammad Abdelmoneim | Ulf Krumnack | Gunther Heidemann | Kai-Uwe Kühnberger
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Word sense disambiguation (WSD) is a key task in natural language processing and lexical semantics. Pre-trained language models with contextualized word embeddings have significantly improved performance in regular WSD tasks. However, these models still struggle with recognizing semantic boundaries and often misclassify homonyms in adversarial context. Therefore, we propose FOOL: FOur-fold Obscure Lexical, a new coarse-grained WSD dataset, which includes four different test sets designed to assess the robustness of language models in WSD tasks. Two sets feature typical WSD scenarios, while the other two include sentences with opposing contexts to challenge the models further.We tested two types of models on the proposed dataset: models with encoders, such as the BERT and T5 series of varying sizes by probing their embeddings, and state-of-the-art large decoder models like GPT-4o and the LlaMA3 family, using zero shot prompting. Across different state-of-the-art language models, we observed a decrease in performance in the latter two sets compared to the first two, with some models being affected more than others. We show interesting findings where small models like T5-large and BERT-large performed better than GPT-4o on Set 3 of the dataset. This indicates that, despite excelling in regular WSD tasks, these models still struggle to correctly disambiguate homonyms in artificial (Set 3) or realistic adversarial contexts (Set 4).
Search
Fix author
Co-authors
- Elia Bruni 3
- Nohayr Muhammad Abdelmoneim 2
- Julius Mayer 2
- Muhammad Abdul-Mageed 1
- Muhammad Abu Ahmad 1
- Ruwa AbuHweidi 1
- Lina Abureesh 1
- Raia Abu Ahmad 1
- Walid Al-Dhabyani 1
- Al-Yas Yaqoob Al-Ghafri 1
- Thikra Al-hibiri 1
- Amir Azad Adli Al-kathiri 1
- Ashwag Alasmari 1
- Rahaf Alhamouri 1
- Hassan Alhuzali 1
- Alshima Mohammed Alkhazimi 1
- Hamzah A. Alsayadi 1
- Omar Said Alshahri 1
- Fakhraddin Alwajih 1
- Alaa Aoun 1
- Aeej Mohammed Aseri 1
- Houdaifa Atou 1
- Ahlam Bashiti 1
- Anas Belfathi 1
- Ismail Berrada 1
- Anne Dedert 1
- Yahya Mohamed EL Hadj 1
- Abdellah El Mekki 1
- Aya El aatar 1
- Wesam El-Sayed 1
- Khalid Elkhidir 1
- AbdelRahim A. Elmadany 1
- Brakehe Emehah 1
- Abdurrahman Gerrio 1
- Karim Ghaddar 1
- Abdulaziz Hafiz 1
- Emhemed S. Hamed 1
- Emira Hamedtou 1
- Nadia Ghezaiel Hammouda 1
- Aya Hamod 1
- Gunther Heidemann 1
- Mustafa Jarrar 1
- Serwan Jassim 1
- Ulf Krumnack 1
- Kai-Uwe Kühnberger 1
- Samar Mohamed Magdy 1
- Yehdih Mohamed 1
- Omer Nacar 1
- Youssef Nafea 1
- Farbod Nosrat Nezami 1
- Baraah Qawasmeh 1
- Razan Saadie 1
- Sara Shatnawi 1
- Serry Sibaee 1
- Okajevo Wilfred 1
- Seyedalireza Yaghoubi 1
- Majdal Yousef 1
- Fadi A. Zaraket 1
- Asila Ismail al Sharji 1