Mutaz al-Khatib

Also published as: Mutaz Al Khatib

2026

Large Language Models (LLMs) are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and the ability to abstain when evidence is insufficient. To address this gap, we introduce IslamicFaithQA, a 3,810-item bilingual (Arabic/English) **generative** benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modeling suite consisting of *(i)* 25K Arabic text-grounded SFT reasoning pairs, *(ii)* 5K bilingual preference samples for reward-guided alignment, and *(iii)* a verse-level Qur’an retrieval corpus of ∼6k atomic *verses* (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic–English robustness even with a small model (i.e., Qwen3 4B). We made the datasets are publicly available (https://huggingface.co/datasets/QCRI/IslamicFaithQA).

2025

pdf bib abs

Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation
Abdessalam Bouchekif | Samer Rashwani | Heba Sbahi | Shahd Gaben | Mutaz Al Khatib | Mohammed Ghaly
Proceedings of The Third Arabic Natural Language Processing Conference

This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, ʿilm al-mawārīth. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test each model’s ability—from understanding the inheritance context to computing the distribution of shares prescribed by Islamic jurisprudence. The results show a wide performance gap among models. o3 and Gemini 2.5 achieved accuracies above 90%, while ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation.We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight the limitations of current models in handling structured legal reasoning and suggest directions for improving their performance in Islamic legal reasoning.