Abubakr Mohamed
2025
IslamicEval 2025: The First Shared Task of Capturing LLMs Hallucination in Islamic Content
Hamdy Mubarak | Rana Malhas | Watheq Mansour | Abubakr Mohamed | Mahmoud Fawzi | Majd Hawasly | Tamer Elsayed | Kareem Mohamed Darwish | Walid Magdy
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Hamdy Mubarak | Rana Malhas | Watheq Mansour | Abubakr Mohamed | Mahmoud Fawzi | Majd Hawasly | Tamer Elsayed | Kareem Mohamed Darwish | Walid Magdy
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture
Fakhraddin Alwajih | Abdellah El Mekki | Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed | Muhammad Abdul-Mageed
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Fakhraddin Alwajih | Abdellah El Mekki | Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed | Muhammad Abdul-Mageed
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks
Advancing Arabic Diacritization: Improved Datasets, Benchmarking, and State-of-the-Art Models
Abubakr Mohamed | Hamdy Mubarak
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Abubakr Mohamed | Hamdy Mubarak
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Arabic diacritics, similar to short vowels in English, provide phonetic and grammatical information but are typically omitted in written Arabic, leading to ambiguity. Diacritization (aka diacritic restoration or vowelization) is essential for natural language processing. This paper advances Arabic diacritization through the following contributions: first, we propose a methodology to analyze and refine a large diacritized corpus to improve training quality. Second, we introduce WikiNews-2024, a multi-reference evaluation methodology with an updated version of the standard benchmark “WikiNews-2014”. In addition, we explore various model architectures and propose a BiLSTM-based model that achieves state-of-the-art results with 3.12% and 2.70% WER on WikiNews-2014 and WikiNews-2024, respectively. Moreover, we develop a model that preserves user-specified diacritics while maintaining accuracy. Lastly, we demonstrate that augmenting training data enhances performance in low-resource settings.
AraSafe: Benchmarking Safety in Arabic LLMs
Hamdy Mubarak | Abubakr Mohamed | Majd Hawasly
Findings of the Association for Computational Linguistics: EMNLP 2025
Hamdy Mubarak | Abubakr Mohamed | Majd Hawasly
Findings of the Association for Computational Linguistics: EMNLP 2025
We introduce AraSafe, the first large-scale native Arabic safety benchmark for large language models (LLMs), addressing the pressing need for culturally and linguistically representative evaluation resources. The dataset comprises 12K naturally occurring, human-written Arabic prompts containing both harmful and non-harmful content across diverse domains, including linguistics, social studies, and science. Each prompt was independently annotated by two experts into one of nine fine-grained safety categories, including ‘Safe/Not Harmful’, ‘Illegal Activities’, ‘Violence or Harm’, ‘Privacy Violation’, and ‘Hate Speech’. Additionally, to support training classifiers for harmful content and due to the imbalanced representation of harmful content in the natural dataset, we create a synthetic dataset of additional 12K harmful prompts generated by GPT-4o via carefully designed prompt engineering techniques. We benchmark a number of Arabic-centric and multilingual models in the 7 to 13B parameter range, including Jais, AceGPT, Allam, Fanar, Llama-3, Gemma-2, and Qwen3, as well as BERT-based fine-tuned classifier models on detecting harmful prompts. GPT-4o was used as an upper-bound reference baseline. Our evaluation reveals critical safety blind spots in Arabic LLMs and underscores the necessity of localized, culturally grounded benchmarks for building responsible AI systems.