Heba Sbahi

2025

pdf bib abs
Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation
Abdessalam Bouchekif | Samer Rashwani | Heba Sbahi | Shahd Gaben | Mutaz Al Khatib | Mohammed Ghaly
Proceedings of The Third Arabic Natural Language Processing Conference

This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, ʿilm al-mawārīth. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test each model’s ability—from understanding the inheritance context to computing the distribution of shares prescribed by Islamic jurisprudence. The results show a wide performance gap among models. o3 and Gemini 2.5 achieved accuracies above 90%, while ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation.We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight the limitations of current models in handling structured legal reasoning and suggest directions for improving their performance in Islamic legal reasoning.

This paper provides a comprehensive overview of the QIAS 2025 shared task, organized as part of the ArabicNLP 2025 conference and co-located with EMNLP 2025. The task was designed for the evaluation of large language models in the complex domains of religious and legal reasoning. It comprises two subtasks: (1) Islamic Inheritance Reasoning, requiring models to compute inheritance shares according to Islamic jurisprudence, and (2) Islamic Knowledge Assessment, which covers a range of traditional Islamic disciplines. Both subtasks were structured as multiple-choice question answering challenges, with questions stratified by varying difficulty levels. The shared task attracted significant interest, with 44 teams participating in the development phase, from which 18 teams advanced to the final test phase. Of these, 6 teams submitted entries for both subtasks, 8 for Task 1 only, and two for Task 3 only. Ultimately, 16 teams submitted system description papers. Herein, we detail the task’s motivation, dataset construction, evaluation protocol, and present a summary of the participating systems and their results.

Co-authors

Mutaz Alkhatib 1

Aiman Erbad 1

Emad Soliman Ali Mohamed 1

Wajdi Zaghouani 1

Venues

arabicnlp2

Fix author