Arabic Natural Language Processing Conference (2025)

Volumes

Proceedings of The Third Arabic Natural Language Processing Conference 40 papers
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks 139 papers

pdf (full)
bib (full) Proceedings of The Third Arabic Natural Language Processing Conference

Under-represented languages suffer from a lack of data, and as a result, there are few LLMs that support them. Extending an existing LLM to a new language is a practical option for startups, university labs, and organizations with limited budgets. This process involves several steps. In this paper, we describe how we adapted the Falcon3-7B model to Arabic, covering everything from data collection and training to evaluation. Falcon-Arabic was trained exclusively on native data to better capture the cultural and linguistic aspects of the language. Our evaluations show that Falcon-Arabic achieves state-of-the-art results on a range of Arabic benchmarks.

pdf bib abs
ArabJobs: A Multinational Corpus of Arabic Job Ads
Mo El-Haj

ArabJobs is a publicly available corpus of Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the United Arab Emirates. Comprising over 8,500 postings and more than 550,000 words, the dataset captures linguistic, regional, and socio-economic variation in the Arab labour market. We present analyses of gender representation and occupational structure, and highlight dialectal variation across ads, which offers opportunities for future research. We also demonstrate applications such as salary estimation and job category normalisation using large language models, alongside benchmark tasks for gender bias detection and profession classification. The findings show the utility of ArabJobs for fairness-aware Arabic NLP and labour market research. The dataset is publicly available on GitHub: https://github.com/drelhaj/ArabJobs.

pdf bib abs
Semitic Root Encoding: Tokenization Based on the Templatic Morphology of Semitic Languages in NMT
Brendan T. Hatch | Stephen D. Richardson

The morphological structure of Semitic languages, such as Arabic, is based on non-concatenative roots and templates. This complex word structure used by humans is obscured to neural models that employ traditional tokenization algorithms, such as byte-pair encoding (BPE) (Sennrich et al., 2016; Gage, 1994). In this work, we present and evaluate Semitic Root Encoding (SRE), a tokenization method that represents both concatenative and non-concatenative structures in Semitic words with sequences of root, template stem, and BPE tokens. We apply the method to neural machine translation (NMT) and find that SRE tokenization yields an average increase of 1.15 BLEU over the baseline. SRE tokenization is also robust against generating combinations of roots with template stems that do not occur in nature. Finally, we compare the performance of SRE to tokenization based on non-linguistic root and template structures and tokenization based on stems, providing evidence that NMT models are capable of leveraging tokens based on non-concatenative Semitic morphology.

Arabic is one of the most widely spoken languages in the world, yet efforts to develop and evaluate Large Language Models (LLMs) for Arabic remain relatively limited. Most existing Arabic benchmarks focus on linguistic, cultural, or religious content, leaving a significant gap in areas like STEM and coding domains that are increasingly relevant for real-world LLM applications. To help bridge this gap, we present 3LM, a suite of three benchmarks designed specifically for Arabic. The first is a set of STEM-related question-answer pairs, naturally sourced from Arabic textbooks and educational worksheets. The second consists of synthetically generated STEM questions, created using the same sources. The third benchmark focuses on code generation, built through a careful translation of two widely used code benchmarks, incorporating a human-in-the-loop process with several rounds of review to ensure high-quality and faithful translations. We release all three benchmarks publicly to support the growth of Arabic LLM research in these essential but underrepresented areas.

pdf bib abs
TuniFra: A Tunisian Arabic Speech Corpus with Orthographic Transcriptions and French Translations
Alex Choux | Marko Avila | Josep Crego | Fethi Bougares | Antoine Laurent

We introduce TuniFra, a novel and comprehensive corpus developed to advance research in Automatic Speech Recognition (ASR) and Speech-to-Text Translation (STT) for Tunisian Arabic, a notably low-resourced language variety. The TuniFra corpus comprises 15 hours of native Tunisian Arabic speech, carefully transcribed and manually translated into French. While the development of ASR and STT systems for major languages is supported by extensive datasets, low-resource languages such as Tunisian Arabic face significant challenges due to limited training data, particularly for speech technologies. TuniFra addresses this gap by offering a valuable resource tailored for both ASR and STT tasks in the Tunisian dialect. We describe our methodology for data collection, transcription, and annotation, and present initial baseline results for both Tunisian Arabic speech recognition and Tunisian Arabic–French speech translation.

pdf bib abs
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Chen Amiraz | Yaroslav Fyodorov | Elad Haramaty | Zohar Karnin | Liane Lewin-Eytan

Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior.Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with substantial performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose two simple retrieval strategies that address this source of failure by enforcing equal retrieval from both languages or by translating the query, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.

pdf bib abs
Open-domain Arabic Conversational Question Answering with Question Rewriting
Mariam E. Hassib | Nagwa El-Makky | Marwan Torki

Conversational question-answering (CQA) plays a crucial role in bridging the gap between human language and machine understanding, enabling more natural and interactive interactions with AI systems. In this work, we present the first results on open-domain Arabic CQA using deep learning. We introduce AraQReCC, a large-scale Arabic CQA dataset containing 9K conversations with 62K question-answer pairs, created by translating a subset of the QReCC dataset. To ensure data quality, we used COMET-based filtering and manual ratings from large language models (LLMs), such as GPT-4 and LLaMA, selecting conversations with COMET scores, along with LLM ratings of 4 or more. AraQReCC facilitates advanced research in Arabic CQA, improving clarity and relevance through question rewriting. We applied AraT5 for question rewriting and used BM25 and Dense Passage Retrieval (DPR) for passage retrieval. AraT5 is also used for question answering, completing the end-to-end system. Our experiments show that the best performance is achieved with DPR, attaining an F1 score of 21.51% on the test set. While this falls short of the human upper bound of 40.22%, it underscores the importance of question rewriting and quality-controlled data in enhancing system performance.

pdf bib abs
ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation
Mohammed Sabry Mohammed | Mohammed Khalil

Classical Arabic represents a significant era that encompasses the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, which comprises 66,000 high-quality classical Arabic to English translation samples that cover a wide array of topics including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub. To preserve anonymity during review, we additionally provide an anonymized snapshot at https://drive.google.com/drive/folders/1c_ElsblaOJzQ0TW_M1DugjR2o3Xv9RUo.

pdf bib abs
A-SEA³𝐋-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation
Kesen Wang | Daulet Toibazar | Pedro J Moreno Mengibar

We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

pdf bib abs
Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models
Mostafa Saeed | Nizar Habash

Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both low-resource and dialect-rich scenarios.

pdf bib abs
Saudi-Alignment Benchmark: Assessing LLMs Alignment with Cultural Norms and Domain Knowledge in the Saudi Context
Manal Alhassoun | Imaan Mohammed Alkhanen | Nouf Alshalawi | Ibtehal Baazeem | Waleed Alsanie

For effective use in specific countries, Large Language Models (LLMs) need a strong grasp of local culture and core knowledge to ensure socially appropriate, context-aware, and factually correct responses. Existing Arabic and Saudi benchmarks are limited, focusing mainly on dialects or lifestyle, with little attention to deeper cultural or domain-specific alignment from authoritative sources. To address this gap and the challenge LLMs face with non-Western cultural nuance, this study introduces the Saudi-Alignment Benchmark. It consists of 874 manually curated questions across two core cultural dimensions: Saudi Cultural and Ethical Norms, and Saudi Domain Knowledge. These questions span multiple subcategories and use three formats to assess different goals with verified sources. Our evaluation reveals significant variance in LLM alignment. GPT-4 achieved the highest overall accuracy (83.3%), followed by ALLaM-7B (81.8%) and Llama-3.3-70B (81.6%), whereas Jais-30B exhibited a pronounced shortfall at 21.9%. Furthermore, multilingual LLMs excelled in norms; ALLaM-7B in domain knowledge. Considering the effect of question format, LLMs generally excelled in selected-response formats but showed weaker results on generative tasks, indicating that recognition-based benchmarks alone may overestimate cultural and contextual alignment. These findings highlight the need for tailored benchmarks and reveal LLMs’ limitations in achieving cultural grounding, particularly in underrepresented contexts like Saudi Arabia.

pdf bib abs
AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Aisha Alansari | Hamzah Luqman

Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs’ hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic’s widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs’ outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: https://github.com/aishaalansari57/AraHalluEval

pdf bib abs
Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data
Chatrine Qwaider | Kirill Chirkunov | Bashar Alhafni | Nizar Habash | Ted Briscoe

Prompt relevance is a critical yet underexplored dimension in Arabic Automated Essay Scoring (AES). We present the first systematic study of binary prompt-essay relevance classification, supporting both AES scoring and dataset annotation. To address data scarcity, we built a synthetic dataset of on-topic and off-topic pairs and evaluated multiple models, including threshold-based classifiers, SVMs, causal LLMs, and a fine-tuned masked SBERT model. For real-data evaluation, we combined QAES with ZAEBUC, creating off-topic pairs via mismatched prompts. We also tested prompt expansion strategies using AraVec, CAMeL, and GPT-4o. Our fine-tuned SBERT achieved 98% F1 on synthetic data and strong results on QAES+ZAEBUC, outperforming SVMs and threshold-based baselines and offering a resource-efficient alternative to LLMs. This work establishes the first benchmark for Arabic prompt relevance and provides practical strategies for low-resource AES.

pdf bib abs
WojoodOntology: Ontology-Driven LLM Prompting for Unified Information Extraction Tasks
Alaa Aljabari | Nagham Hamad | Mohammed Khalilia | Mustafa Jarrar

Information Extraction tasks such as Named Entity Recognition and Relation Extraction are often developed using diverse tagsets and annotation guidelines. This presents major challenges for model generalization, cross-dataset evaluation, tool interoperability, and broader industry adoption. To address these issues, we propose an information extraction ontology, , which covers a wide range of named entity types and relations. serves as a semantic mediation framework that facilitates alignment across heterogeneous tagsets and annotation guidelines. We propose two ontology-based mapping methods: (i) as a set of mapping rules for uni-directional tagset alignment; and (ii) as ontology-based prompting, which incorporates the ontology concepts directly into prompts, enabling large language models (LLMs) to perform more effective and bi-directional mappings. Our experiments show a 15% improvement in out-of-domain mapping accuracy when using ontology-based prompting compared to rule-based methods. Furthermore, is aligned with Schema.org and Wikidata, enabling interoperability with knowledge graphs and facilitating broader industry adoption. The is open source and available at https://sina.birzeit.edu/wojood.

pdf bib abs
Tahdib: A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition
Mohamad Elzohbi | Richard Zhao

This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.

pdf bib abs
Can LLMs Directly Retrieve Passages for Answering Questions from Qur’an?
Sohaila Eltanbouly | Salam Albatarni | Shaimaa Hassanein | Tamer Elsayed

The Holy Qur’an provides timeless guidance, addressing modern challenges and offering answers to many important questions. The Qur’an QA 2023 shared task introduced the Qur’anic Passage Retrieval (QPR) task, which involves retrieving relevant passages in response to MSA questions. In this work, we evaluate the ability of seven pre-trained large language models (LLMs) to retrieve relevant passages from the Qur’an in response to given questions, considering zero-shot and several few-shot scenarios. Our experiments show that the best model, Claude, significantly outperforms the state-of-the-art QPR model by 28 points on MAP and 38 points on MRR, exhibiting an impressive improvement of about 113% and 82%, respectively.

pdf bib abs
ArabEmoNet: A Lightweight Hybrid 2D CNN-BiLSTM Model with Attention for Robust Arabic Speech Emotion Recognition
Ali Abouzeid | Bilal Elbouardi | Mohamed Maged | Shady Shehata

Speech emotion recognition is vital for human-computer interaction, particularly for low-resource languages like Arabic, which face challenges due to limited data and research. We introduce ArabEmoNet, a lightweight architecture designed to overcome these limitations and deliver state-of-the-art performance. Unlike previous systems relying on discrete MFCC features and 1D convolutions, which miss nuanced spectro-temporal patterns, ArabEmoNet uses Mel spectrograms processed through 2D convolutions, preserving critical emotional cues often lost in traditional methods. While recent models favor large-scale architectures with millions of parameters, ArabEmoNet achieves superior results with just 1 million parameters—90 times smaller than HuBERT base and 74 times smaller than Whisper. This efficiency makes it ideal for resource-constrained environments. ArabEmoNet advances Arabic speech emotion recognition, offering exceptional performance and accessibility for real-world applications.

pdf bib abs
Capturing Intra-Dialectal Variation in Qatari Arabic: A Corpus of Cultural and Gender Dimensions
Houda Bouamor | Sara Al-Emadi | Zeinab Ibrahim | Hany Fazzaa | Aisha Al-Sultan

We present the first publicly available, multidimensional corpus of Qatari Arabic that captures intra-dialectal variation across Urban and Bedouin speakers. While often grouped under the label of “Gulf Arabic”, Qatari Arabic exhibits rich phonological, lexical, and discourse-level differences shaped by gender, age, and sociocultural identity. Our dataset includes aligned speech and transcriptions from 255 speakers, stratified by gender and age, and collected through structured interviews on culturally salient topics such as education, heritage, and social norms. The corpus reveals systematic variation in pronunciation, vocabulary, and narrative style, offering insights for both sociolinguistic analysis and computational modeling. We also demonstrate its utility through preliminary experiments in the prediction of dialects and genders. This work provides the first large-scale, demographically balanced corpus of Qatari Arabic, laying a foundation for both sociolinguistic research and the development of dialect-aware NLP systems.

pdf bib abs
Feature Engineering is not Dead: A Step Towards State of the Art for Arabic Automated Essay Scoring
Marwan Sayed | Sohaila Eltanbouly | May Bashendy | Tamer Elsayed

Automated Essay Scoring (AES) has shown significant advancements in educational assessment. However, under-resourced languages like Arabic have received limited attention. To bridge this gap and enable robust Arabic AES, this paper introduces the first publicly-available comprehensive set of engineered features tailored for Arabic AES, covering surface-level, readability, lexical, syntactic, and semantic features. Experiments are conducted on a dataset of 620 Arabic essays, each annotated with both holistic and trait-specific scores. Our findings demonstrate that the proposed feature set is effective across different models and competitive with recent NLP advances including LLMs, establishing the state-of-the-art performance and providing strong baselines for future Arabic AES research. Moroever, the resulting feature set offers a reusable and foundational resource, contributing towards the development of more effective Arabic AES systems.

This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, ʿilm al-mawārīth. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test each model’s ability—from understanding the inheritance context to computing the distribution of shares prescribed by Islamic jurisprudence. The results show a wide performance gap among models. o3 and Gemini 2.5 achieved accuracies above 90%, while ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation.We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight the limitations of current models in handling structured legal reasoning and suggest directions for improving their performance in Islamic legal reasoning.

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

pdf bib abs
TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English
Fethi Bougares | Salima Mdhaffar | Haroun Elleuch | Yannick Estève

In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The. collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research studying Tunisian Dialect.

Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (𝛥 ≈ 3pp at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.

pdf bib abs
Zero-Shot and Fine-Tuned Evaluation of Generative LLMs for Arabic Word Sense Disambiguation
Yossra Noureldien | Abdelrazig Mohamed | Farah Attallah

Arabic presents unique challenges for sense level language understanding due to its rich morphology and semantic ambiguity. This paper benchmarks large generative language models (LLMs) for Arabic Word Sense Disambiguation (WSD) under both zero-shot and fine-tuning conditions. We evaluate one proprietary model (GPT-4o) and three opensource models (LLaMA 3.1-8B, Qwen 2.5-7B, and Gemma 2-9B) on two publicly available datasets. In zero-shot settings, GPT-4o achieved the highest overall performance, with comparable results across both datasets, reaching 79% accuracy and an average macro-F1 score of 66.08%. Fine-tuning, however, notably elevated all open models beyond GPT4o’s zero-shot results. Qwen achieved the top scores on one dataset, with an accuracy of 90.77% and a macro-F1 score of 83.98%, while LLaMA scored highest on the other, reaching an accuracy of 88.51% and a macroF1 score of 69.41%. These findings demonstrate that parameter-efficient supervised adaptation can close much of the performance gap and establish strong, reproducible baselines for Arabic WSD using open-source, relatively medium-sized models. Full code is publicly available.

We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model delivers a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to a single language with dual-script usage, addressing an often overlooked aspect in contemporary LLM development.

Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., Persona and System Prompts); (3) Alignment (e.g., Cultural, Safety, Ethics, and Fairness); and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic-centric LLMs and applications while providing concrete recommendations for future efforts in Arabic post-training dataset development.

pdf bib abs
Bridging Dialectal Gaps in Arabic Medical LLMs through Model Merging
Ahmed Ibrahim | Abdullah Hosseini | Hoda Helmy | Wafa Lakhdhar | Ahmed Serag

The linguistic fragmentation of Arabic, with over 30 dialects exhibiting low mutual intelligibility, presents a critical challenge for deploying natural language processing (NLP) in healthcare. Conventional fine-tuning of large language models (LLMs) for each dialect is computationally prohibitive and operationally unsustainable. In this study, we explore model merging as a scalable alternative by integrating three pre-trained LLMs—a medical domain expert, an Egyptian Arabic model, and a Moroccan Darija model—into a unified system without additional fine-tuning. We introduce a novel evaluation framework that assesses both dialectal fidelity via dual evaluation: LLM-based automated scoring and human assessments by native speakers. Our results demonstrate that the merged model effectively handles cross-dialect medical scenarios, such as interpreting Moroccan Darija inputs for Egyptian Arabic-speaking clinicians, while maintaining high clinical relevance. The merging process reduced computational cost by over 60% compared to per-dialect fine-tuning, highlighting its viability for resource-constrained settings. This work offers a promising path for building dialect-aware medical LLMs at scale, with implications for broader deployment across linguistically diverse regions.

pdf bib abs
Tool Calling for Arabic LLMs: Data Strategies and Instruction Tuning
Asım Ersoy | Enes Altinisik | Kareem Mohamed Darwish | Husrev Taha Sencar

Tool calling is a critical capability that allows Large Language Models (LLMs) to interact with external systems, significantly expanding their utility. However, research and resources for tool calling are predominantly English-centric, leaving a gap in our understanding of how to enable this functionality for other languages, such as Arabic. This paper investigates three key research questions: (1) the necessity of in-language (Arabic) tool-calling data versus relying on cross-lingual transfer, (2) the effect of general-purpose instruction tuning on tool-calling performance, and (3) the value of fine-tuning on specific, high-priority tools. To address these questions, we conduct extensive experiments using base and post-trained variants of an open-weight Arabic LLM. To enable this study, we bridge the resource gap by translating and adapting two open-source tool-calling datasets into Arabic. Our findings provide crucial insights into the optimal strategies for developing robust tool-augmented agents for Arabic.

pdf bib abs
Toward Culturally-Aware Arabic Debate Platforms with NLP Support
Khalid Al Khatib | Mohammad Khader

Despite the growing importance of online discourse, Arabic-speaking communities lack platforms that support structured, culturally grounded debate. Mainstream social media rarely fosters constructive engagement, often leading to polarization and superficial exchanges. This paper proposes the development of a culturally aware debate platform tailored to the values and traditions of Arabic-speaking users, with a focus on leveraging advances in natural language processing (NLP). We present findings from a user survey that explores experiences with existing debate tools and expectations for future platforms. Besides, we analyze 30,000 English-language debate topics using large language models (LLMs) to assess their cultural relevance and appropriateness for Arab audiences. We further examine the ability of LLMs to generate new culturally resonant debate topics, contributing to the emerging tasks of culture-aware topic assessment and generation. Finally, we propose a theoretical and technical framework for building an NLP-supported Arabic debate platform. Our work highlights the urgent need for culturally sensitive NLP resources that foster critical thinking, digital literacy, and meaningful deliberation in Arabic.

pdf bib abs
Modeling North African Dialects from Standard Languages
Yassine Toughrai | Kamel Smaïli | David Langlois

Processing North African Arabic dialects presents significant challenges due to high lexical variability, frequent code-switching with French, and the use of both Arabic and Latin scripts. We address this with a phonemebased normalization strategy that maps Arabic and French text into a simplified representation (Arabic rendered in Latin script), reflecting native reading patterns. Using this method, we pretrain BERTbased models on normalized Modern Standard Arabic and French only and evaluate them on Named Entity Recognition (NER) and text classification. Experiments show that normalized standard-language corpora yield competitive performance on North African dialect tasks; in zero-shot NER, Ar_20k surpasses dialectpretrained baselines. Normalization improves vocabulary alignment, indicating that normalized standard corpora can suffice for developing dialect-supportive

pdf bib abs
Learning Word Embeddings from Glosses: A Multi-Loss Framework for Arabic Reverse Dictionary Tasks
Engy Ibrahim | Farhah Adel | Marwan Torki | Nagwa El-Makky

We address the task of reverse dictionary modeling in Arabic, where the goal is to retrieve a target word given its definition. The task comprises two subtasks: (1) generating embeddings for Arabic words based on Arabic glosses, and (2) a cross-lingual setting where the gloss is in English and the target embedding is for the corresponding Arabic word. Prior approaches have largely relied on BERT models such as CAMeLBERT or MARBERT trained with mean squared error loss. In contrast, we propose a novel ensemble architecture that combines MARBERTv2 with the encoder of AraBART, and we demonstrate that the choice of loss function has a significant impact on performance. We apply contrastive loss to improve representational alignment, and introduce structural and center losses to better capture the semantic distribution of the dataset. This multi-loss framework enhances the quality of the learned embeddings and leads to consistent improvements in both monolingual and cross-lingual settings. Our system achieved the best rank metric in both subtasks compared to the previous approaches. These results highlight the effectiveness of combining architectural diversity with task-specific loss functions in representational tasks for morphologically rich languages like Arabic.

We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset’s utility for instruction tuning. Notably, we show that instruction tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.

pdf bib abs
Transfer or Translate? Argument Mining in Arabic with No Native Annotations
Sara Nabhani | Khalid Al Khatib

Argument mining for Arabic remains underexplored, largely due to the scarcity of annotated corpora. To address this gap, we examine the effectiveness of cross-lingual transfer from English. Using the English Persuasive Essays (PE) corpus, annotated with argumentative components (Major Claim, Claim, and Premise), we explore several transfer strategies: training encoder-based multilingual and monolingual models on English data, machine-translated Arabic data, and their combination. We further assess the impact of annotation noise introduced during translation by manually correcting portions of the projected training data. In addition, we investigate the potential of prompting large language models (LLMs) for the task. Experiments on a manually corrected Arabic test set show that monolingual models trained on translated data achieve the strongest performance, with further improvements from small-scale manual correction of training examples.

pdf bib abs
An Exploration of Knowledge Editing for Arabic
Basel Mousi | Nadir Durrani | Fahim Dalvi

While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.

We present Octopus, a first family of modular speech-language models designed for Arabic-English ASR, dialect identification, and speech translation. Built on Whisper-V3 and enhanced with large language models like ALLaM, LLaMA, and DeepSeek, Octopus bridges speech and text through a lightweight projection layer and Q-Former. To broaden its scope beyond speech, Octopus integrates BEATs, a general-purpose audio encoder allowing it to understand both linguistic and acoustic events. Despite its simplicity, this dual-encoder design supports robust performance across multilingual and code-switched scenarios. We also introduce TinyOctopus, a distilled variant using smaller models (Distil-Whisper + LLaMA3-1B / DeepSeek-1.5B), achieving competitive results with just a fraction of the parameters. Fine-tuning on synthetic code-switched data further boosts its performance. Octopus demonstrates the power of compact, extensible architectures in Arabic-centric speech modeling and sets the stage for unified multilingual audio-language understanding.

pdf bib abs
ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training
Majd Hawasly | Tasnim Mohiuddin | Hamdy Mubarak | Sabri Boughorbel

The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.

pdf bib abs
AMCrawl: An Arabic Web-Scale Dataset of Interleaved Image-Text Documents and Image-Text Pairs
Shahad Aboukozzana | Muhammad Kamran J Khan | Ahmed Ali

In this paper, we present the Arabic Multimodal Crawl (AMCrawl), the first native-based Arabic multimodal dataset to our knowledge, derived from the Common Crawl corpus and rigorously filtered for quality and safety. Image-text pair datasets are the standard choice for pretraining multimodal large language models. However, they are often derived from image alt-text metadata, which is typically brief and context-poor, disconnecting images from their broader meaning. Although significant advances have been made in building interleaved image-text datasets for English, such as the OBELICS dataset, a substantial gap remains for native Arabic content. Our processing covered 8.6 million Arabic web pages, yielding 5.8 million associated images and 1.3 billion text tokens. The final dataset includes interleaved image-text documents and question-answer pairs, featuring 2.8 million high-quality interleaved documents and 5 million QA pairs. Alongside the dataset, we release the complete pipeline and code, ensuring reproducibility and encouraging further research and development. To demonstrate the effectiveness of AMCrawl, we introduce a publicly available native Arabic Vision Language model, trained with 13 billion parameters. These models achieve competitive results when benchmarked against publicly available datasets. AMCrawl bridges a critical gap in Arabic multimodal resources, providing a robust foundation for developing Arabic multimodal large language models and fostering advancements in this underrepresented area. Code: github.com/shahad-aboukozzana/AMCrawl

pdf bib abs
DialG2P: Dialectal Grapheme-to-Phoneme. Arabic as a Case Study
Majd Hawasly | Hamdy Mubarak | Ahmed Abdelali | Ahmed Ali

Grapheme-to-phoneme (G2P) models are essential components in text-to-speech (TTS) and pronunciation assessment applications. While standard forms of languages have gained attention in that regard, dialectal speech, which often serves as the primary means of spoken communication for many communities, as it is the case for Arabic, has not received the same level of focus. In this paper, we introduce an end-to-end dialectal G2P for Egyptian Arabic, a dialect without standard orthography. Our novel architecture accomplishes three tasks: (i) restores short vowels of the diacritical marks for the dialectal text; (ii) maps certain characters that happen only in the spoken version of the dialectal Arabic to their dialect-specific character transcriptions; and finally (iii) converts the previous step output to the corresponding phoneme sequence. We benchmark G2P on a modular cascaded system, a large language model, and our multi-task end-to-end architecture.

pdf bib abs
Shawarma Chats: A Benchmark Exact Dialogue & Evaluation Platter in Egyptian, Maghrebi & Modern Standard Arabic—A Triple-Dialect Feast for Hungry Language Models
Kamyar Zeinalipour | Mohamed Zaky Saad | Oumaima Attafi | Marco Maggini | Marco Gori

Content-grounded dialogue evaluation for Arabic remains under-resourced, particularly across Modern Standard (MSA), Egyptian, and Maghrebi varieties. We introduce Shawarma Chats, a benchmark of 30,000 six-turn conversations grounded in Wikipedia content, evenly split across the three dialects. To build this corpus, we prompt five frontier LLMs GPT-4o, Gemini 2.5 Flash, Qwen-Plus, DeepSeek-Chat, and Mistral Large to generate 1,500 seed dialogues. Native Arabic speakers evaluate these outputs to select the most effective generator and most human-aligned grader. Sub-A dialogues undergo a two-pass, rationale-driven self-repair loop where the grader critiques and the generator revises; unresolved cases are manually corrected. We apply this pipeline to 10,000 Wikipedia paragraphs to create 30,000 high-quality conversations 10,000 per dialect—at modest human cost. To validate the benchmark, we LoRA-fine-tune six open LLMs (1–24 B parameters) on Shawarma Chats and observe consistent gains in automatic-grader scores, BERTScore, BLEU and ROUGE particularly for models larger than 7 B parameters. Shawarma Chats thus establishes the first large-scale, dialect-aware, content-grounded dialogue benchmark for Arabic.

pdf (full)
bib (full) Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present an overview of the AraGenEval shared task, organized as part of the ArabicNLP 2025 conference. This task introduced the first benchmark suite for Arabic authorship analysis, featuring three subtasks: Authorship Style Transfer, Authorship Identification, and AI-Generated Text Detection. We curated high-quality datasets, including over 47,000 paragraphs from 21 authors and a balanced corpus of human- and AI-generated texts. The task attracted significant global participation, with 72 registered teams from 16 countries. The results highlight the effectiveness of transformer-based models, with top systems leveraging prompt engineering for style transfer, model ensembling for authorship identification, and a mix of multilingual and Arabic-specific models for AI text detection. This paper details the task design, datasets, participant systems, and key findings, establishing a foundation for future research in Arabic stylistics and trustworthy NLP.

pdf bib
MISSION at AraGenEval Shared Task: Enhanced Arabic Authority Classification
Thamer Maseer Alharbi

pdf bib
Nojoom.AI at AraGenEval Shared Task: Arabic Authorship Style Transfer
Hafsa Kara Achira | Mourad Bouache | Mourad Dahmane

pdf bib
LMSA at AraGenEval Shared Task: Ensemble-Based Detection of AI-Generated Arabic Text Using Multilingual and Arabic-Specific Models
Kaoutar Zita | Attia Nehar | Abdelkader Khelil | Slimane Bellaouar | Hadda Cherroun

pdf bib
Amr&MohamedSabaa at AraGenEval shared task: Arabic Authorship Identification using Term Frequency – Inverse Document Frequency Features with Supervised Machine Learning
Amr Sabaa | Mohamed Sabaa

pdf bib
NLP_wizard at AraGenEval shared task: Embedding-Based Classification for AI Detection and Authorship Attribution
Mena Hany

pdf bib
PTUK-HULAT at AraGenEval Shared Task: Fine-tuning XLM-RoBERTa for AI-Generated Arabic News Detection
Tasneem Duridi | Areej Jaber | Paloma Martínez

pdf bib
Sebaweh at AraGenEval Shared Task: BERENSE - BERt based ENSEmbler for Arabic Authorship Identification
Muhammad Helmy | Batool Najeh Balah | Ahmed Mohamed Sallam | Ammar Sherif

pdf bib
CUET-NLP_Team_SS306 at AraGenEval Shared Task: A Transformer-based Framework for Detecting AI-Generated Arabic Text
Sowrav Nath | Shadman Saleh | Kawsar Ahmed | Mohammed Moshiul Hoque

pdf bib
BUSTED at ARATECT Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
Ali Zain | Sareem Farooqui | Muhammad Rafi

pdf bib
CIOL at AraGenEval shared task: Authorship Identification and AI Generated Text Detection in Arabic using Pretrained Models
Sadia Tasnim Meem | Azmine Toushik Wasi

pdf bib
Osint at AraGenEval shared task: Fine-Tuned Modeling for Tracking Style Signatures and AI Generation in Arabic Texts
Shifali Agrahari | Hemanth Prakash Simhadri | Ashutosh Kumar Verma | Ranbir Singh Sanasam

pdf bib
MarsadLab at AraGenEval Shared Task: LLM-Based Approaches to Arabic Authorship Style Transfer and Identification
Md. Rafiul Biswas | Mabrouka Bessghaier | Firoj Alam | Wajdi Zaghouani

pdf bib
REGLAT at AraGenEval shared task: Morphology-Aware AraBERT for Detecting Arabic AI-Generated Text
Mariam Labib | Nsrin Ashraf | Mohammed Aldawsari | Hamada Nayel

pdf bib
Jenin at AraGenEval Shared Task: Parameter-Efficient Fine-Tuning and Layer-Wise Analysis of Arabic LLMs for Authorship Style Transfer and Classification
Huthayfa Malhis | Mohammad Tami | Huthaifa I. Ashqar

We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 co-located with EMNLP 2025. This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.

pdf bib
NYUAD at AraHealthQA Shared Task: Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
Nouar AlDahoul | Yasir Zaki

pdf bib
MedLingua at MedArabiQ2025: Zero- and Few-Shot Prompting of Large Language Models for Arabic Medical QA
Fatimah Mohamed Emad Elden | Mumina Ab. Abukar

pdf bib
Sakinah-AI at MentalQA: A Comparative Study of Few-Shot, Optimized, and Ensemble Methods for Arabic Mental Health Question Classification
Fatimah Mohamed Emad Elden | Mumina Ab. Abukar

pdf bib
MindLLM at AraHealthQA 2025 Track 1: Leveraging Large Language Models for Mental Health Question Answering
Nejood Abdulaziz Bin Eshaq

pdf bib
Quasar at AraHealthQA Track 1 : Leveraging Zero-Shot Large Language Models for Question and Answer Categorization in Arabic Mental Health
Adiba Fairooz Chowdhury | Md Sagor Chowdhury

pdf bib
Binary_Bunch at AraHealthQA Track 1: Arabic Mental Health Q&A Classification Using Data Augmentation and Transformer Models
Sajib Bhattacharjee | Ratnajit Dhar | Kawsar Ahmed | Mohammed Moshiul Hoque

pdf bib
!MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning
Mohamed Younes | Seif Ahmed | Mohamed Basem

pdf bib
Sindbad at AraHealthQA Track 1: Leveraging Large Language Models for Mental Health Q&A
AbdulRahman A. Morsy | Saad Mankarious | Ayah Zirikly

pdf bib
Arabic Mental Health Question Answering: A Multi-Task Approach with Advanced Retrieval-Augmented Generation
Abdelaziz Amr AbdelAziz | Mohamed Ahmed Youssef | Mamdouh Mohamed Koritam | Marwa Eldeeb | Ensaf Hussein

pdf bib
AraMinds at AraHealthQA 2025: A Retrieval-Augmented Generation System for Fine-Grained Classification and Answer Generation of Arabic Mental Health Q&A
Mohamed Zaytoon | Ahmed Mahmoud Salem | Ahmed Sakr | Hossam Elkordi

pdf bib
Fahmni at AraHealthQA Track 1: Multi-Agent Retrieval-Augmented Generation and Multi-Label Classification for Arabic Mental Health Q&A
Caroline Sabty | Mohamad Rasmy | Mohamed Eyad Badran | Nourhan Sakr | Alia El Bolock

pdf bib
MedGapGab at AraHealthQA: Modular LLM Assignment for Gaps and Gabs in Arabic Medical Question Answering
Baraa Hikal

pdf bib
Egyhealth at General Arabic Health QA (MedArabiQ): An Enhanced RAG Framework with Large-Scale Arabic Q&A Medical Data
Hossam Amer | Rawan Tarek Taha | Gannat Elsayed | Ensaf Hussein Mohamed

pdf bib
mucAI at AraHealthQA 2025: Explain–Retrieve–Verify (ERV) Workflow for Multi-Label Arabic Health QA Classification
Ahmed Abdou

pdf bib
MarsadLab at AraHealthQA: Hybrid Contextual–Lexical Fusion with AraBERT for Question and Answer Categorization
Mabrouka Bessghaier | Shimaa Ibrahim | Md. Rafiul Biswas | Wajdi Zaghouani

pdf bib abs
BAREC Shared Task 2025 on Arabic Readability Assessment
Khalid N. Elmadani | Bashar Alhafni | Hanada Taha | Nizar Habash

We present the results and findings of the BAREC Shared Task 2025 on Arabic Readability Assessment, organized as part of The Third Arabic Natural Language Processing Conference (ArabicNLP 2025). The BAREC 2025 shared task focuses on automatic readability assessment using BAREC Corpus, addressing fine-grained classification into 19 readability levels. The shared task includes two sub-tasks: sentence-level classification and document-level classification, and three tracks: (1) Strict Track, where only BAREC Corpus is allowed; (2) Constrained Track, restricted to the BAREC Corpus, SAMER Corpus, and SAMER Lexicon, and (3) Open Track, allowing any external resources. A total of 22 teams from 12 countries registered for the task. Among these, 17 teams submitted system description papers. The winning team achieved 87.5 QWK on the sentence-level task and 87.4 QWK on the document-level task.

pdf bib
Syntaxa at BAREC Shared Task 2025: BERTnParse - Fusion of BERT and Dependency Graphs for Readability Prediction
Ahmed Bahloul

pdf bib
GNNinjas at BAREC Shared Task 2025: Lexicon-Enriched Graph Modeling for Arabic Document Readability Prediction
Passant Elchafei | Mayar Osama | Mohamad Rageh | Mervat Abu-Elkheir

pdf bib
ZAI at BAREC Shared Task 2025: AraBERT CORAL for Fine Grained Arabic Readability
Ahmad M. Nazzal

pdf bib
ANLPers at BAREC Shared Task 2025: Readability of Embeddings Training Neural Readability Classifiers on the BAREC Corpus
Serry Sibaee | Omer Nacar | Yasser Alhabashi | Adel Ammar | Wadii Boulila

pdf bib
MarsadLab at BAREC Shared Task 2025: Strict-Track Readability Prediction with Specialized AraBERT Models on BAREC
Shimaa Ibrahim | Md. Rafiul Biswas | Mabrouka Bessghaier | Wajdi Zaghouani

pdf bib
SATLab at BAREC Shared Task 2025: Optimizing a Language-Independent System for Fine-Grained Readability Assessment
Yves Bestgen

pdf bib
MorphoArabia at BAREC Shared Task 2025: A Hybrid Architecture with Morphological Analysis for Arabic Readability Assessment
Fatimah Mohamed Emad Elden

pdf bib
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
Mohamed Basem | Mohamed Younes | Seif Ahmed | Abdelrahman Moustafa

pdf bib
Qais at BAREC Shared Task 2025: A Fine-Grained Approach for Arabic Readability Classification Using a pre-trained model
Samar Ahmad

pdf bib
mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment
Ahmed Abdou

pdf bib
AMAR at BAREC Shared Task 2025: Arabic Meta-learner for Assessing Readability
Mostafa Saeed | Rana Waly | Abdelaziz Ashraf Hussein

pdf bib
Noor at BAREC Shared Task 2025: A Hybrid Transformer-Feature Architecture for Sentence-level Readability Assessment
Nour Rabih

pdf bib
PalNLP at BAREC Shared Task 2025: Predicting Arabic Readability Using Ordinal Regression and K-Fold Ensemble Learning
Mutaz Ayesh

pdf bib
Pixels at BAREC Shared Task 2025: Visual Arabic Readability Assessment
Ben Sapirstein

pdf bib
Phantoms at BAREC Shared Task 2025: Enhancing Arabic Readability Prediction with Hybrid BERT and Linguistic Features
Ahmed Alhassan | Asim Mohamed | Moayad Elamin

pdf bib
STBW at BAREC Shared Task 2025: AraBERT-v2 with MSE-SoftQWK Loss for Sentence-Level Arabic Readability
Saoussan Trigui

pdf bib
LIS at BAREC Shared Task 2025: Multi-Scale Curriculum Learning for Arabic Sentence-Level Readability Assessment Using Pre-trained Language Models
Anya Amel Nait Djoudi | Patrice Bellot | Adrian-Gabriel Chifu

We present ImageEval 2025, the first shared task dedicated to Arabic image captioning. The task addresses the critical gap in multimodal Arabic NLP by focusing on two complementary subtasks: (1) creating the first open-source, manually-captioned Arabic image dataset through a collaborative datathon, and (2) developing and evaluating Arabic image captioning models. A total of 44 teams registered, of which eight submitted during the test phase, producing 111 valid submissions. Evaluation was conducted using automatic metrics, LLM-based judgment, and human assessment. In Subtask 1, the best-performing system achieved a cosine similarity of 65.5, while in Subtask 2, the top score was 60.0. Although these results show encouraging progress, they also confirm that Arabic image captioning remains a challenging task, particularly due to cultural grounding requirements, morphological richness, and dialectal variation. All datasets, baseline models, and evaluation tools are released publicly to support future research in Arabic multimodal NLP.

pdf bib
Codezone Research Group at ImageEval Shared-Task 2: Arabic Image Captioning Using BLIP and M2M100: A Two-Stage Translation Approach for ImageEval 2025
Abdulkadir Shehu Bichi

pdf bib
BZU-AUM@ImageEval2025: An Arabic Image Captioning Dataset for Conflict Narratives with Human Annotation
Mohammed Alkhanafseh | Ola Surakhi | Abdallah Abedaljalill

pdf bib
ImpactAi at ImageEval 2025 Shared Task: Region-Aware Transformers for Arabic Image Captioning – A Case Study on the Palestinian Narrative
Rabee Al-Qasem | Mohannad Hendi

pdf bib
VLCAP at ImageEval 2025 Shared Task: Multimodal Arabic Captioning with Interpretable Visual Concept Integration
Passant Elchafei | Amany Fashwan

pdf bib
PhantomTroupe at ImageEval 2025 Shared Task: Multimodal Arabic Image Captioning through Translation-Based Fine-Tuning of LLM Models
Muhammad Abu Horaira | Farhan Amin | Sakibul Hasan | Md. Tanvir Ahammed Shawon | Muhammad Ibrahim Khan

pdf bib
NU_Internship team at ImageEval 2025: From Zero-Shot to Ensembles: Enhancing Grounded Arabic Image Captioning
Rana Gaber | Seif Eldin Amgad | Ahmed Sherif Nasri | Mohamed Ibrahim Ragab | Ensaf Hussein Mohamed

pdf bib
Averroes at ImageEval 2025 Shared Task: Advancing Arabic Image Captioning with Augmentation and Two-Stage Generation
Mariam Saeed | Sarah Elshabrawy | Abdelrahman Hagrass | Mazen Yasser | Ayman Khalafallah

pdf bib
AZLU at ImagEval Shared Task: Bridging Linguistics and Cultural Gaps in Arabic Image Captioning
Sarah Yassine

We present the findings of the first shared task on Qur’anic pronunciation assessment, which focuses on addressing the unique challenges of evaluating the precise pronunciation of Qur’anic recitation. To fill an existing research gap, the Iqra’Eval 2025 shared task introduces the first open benchmark for Mispronunciation Detection and Diagnosis (MDD) in Qur’anic recitation, using Modern Standard Arabic (MSA) reading of Qur’anic texts as its case study. The task provides a comprehensive evaluation framework with increasingly complex subtasks: error localization and detailed error diagnosis. Leveraging the recently developed QuranMB benchmark dataset along with auxiliary training resources, this shared task aims to stimulate research in an area of both linguistic and cultural significance while addressing computational challenges in pronunciation assessment.

pdf bib
Hafs2Vec: A System for the IqraEval Arabic and Qur’anic Phoneme-level Pronunciation Assessment
Ahmed Ibrahim

pdf bib
Phoneme-level mispronunciation detection in Quranic recitation using ShallowTransformer
Mohamed Nadhir Daoud | Mohamed Anouar Ben Messaoud

pdf bib
AraS2P: Arabic Speech-to-Phonemes System
Bassam Mattar | Mohamed Fayed | Ayman Khalafallah

pdf bib
Metapseud at Iqra’Eval: Domain Adaptation with Multi-Stage Fine-Tuning for Phoneme-Level Qur’anic Mispronunciation Detection
Ayman Mansour

Hallucination in Large Language Models (LLMs) remains a significant challenge and continues to draw substantial research attention. The problem becomes especially critical when hallucinations arise in sensitive domains, such as religious discourse. To address this gap, we introduce IslamicEval 2025—the first shared task specifically focused on evaluating and detecting hallucinations in Islamic content. The task consists of two subtasks: (1) Hallucination Detection and Correction of quoted verses (Ayahs) from the Holy Quran and quoted Hadiths; and (2) Qur’an and Hadith Question Answering, which assesses retrieval models and LLMs by requiring answers to be retrieved from grounded, authoritative sources. Thirteen teams participated in the final phase of the shared task, employing a range of pipelines and frameworks. Their diverse approaches underscore both the complexity of the task and the importance of effectively managing hallucinations in Islamic discourse.

pdf bib
NUR at IslamicEval 2025 Shared Task: Retrieval-Augmented LLMs for Qur’an and Hadith QA
Serag Amin | Ranwa Aly | Yara Allam | Yomna Eid | Ensaf Hussein Mohamed

pdf bib
BurhanAI at IslamicEval 2025 Shared Task: Combating Hallucinations in LLMs for Islamic Content; Evaluation, Correction, and Retrieval-Based Solution
Arij Al Adel | Abu Bakr Soliman | Mohamed Sakher Sawan | Rahaf Al-Najjar | Sameh Amin

pdf bib
HUMAIN at IslamicEval 2025 Shared Task 1: A Three-Stage LLM-Based Pipeline for Detecting and Correcting Hallucinations in Quran and Hadith
Arwa Omayrah | Sakhar Alkhereyf | Ahmed Abdelali | Abdulmohsen Al-Thubaity | Jeril Kuriakose | Ibrahim AbdulMajeed

pdf bib
TCE at IslamicEval 2025: Retrieval-Augmented LLMs for Quranic and Hadith Content Identification and Verification
Mohammed ElKoumy | Khalid Allam | Ahmad Tamer | Mohamed Alqablawi

pdf bib
ThinkDrill at IslamicEval 2025 Shared Task: LLM Hybrid Approach for Qur’an and Hadith Question Answering
Eman Elrefai | Toka Khaled | Ahmed Soliman

pdf bib
Burhan at IslamicEval: Fact-Augmented and LLM-Driven Retrieval for Islamic QA
Mohammad Basheer | Watheq Mansour | Abdulhamid Touma | Ahmad Qadeib Alban

pdf bib
Isnad AI at IslamicEval 2025: A Rule-Based System for Identifying Religious Texts in LLM Outputs
Fatimah Mohamed Emad Elden

This paper presents the MAHED 2025 Shared Task on Multimodal Detection of Hope and Hate Emotions in Arabic Content, comprising three subtasks: (1) text-based classification of Arabic content into hate and hope, (2) multi-task learning for joint prediction of emotions, offensive content, and hate speech, and (3) multimodal detection of hateful content in Arabic memes. We provide three high-quality datasets totaling over 22,000 instances sourced from social media platforms, annotated by native Arabic speakers with Cohen’s Kappa exceeding 0.85. Our evaluation attracted 46 leaderboard submissions from participants, with systems leveraging Arabic-specific pre-trained language models (AraBERT, MARBERT), large language models (GPT-4, Gemini), and multimodal fusion architectures combining CLIP vision encoders with Arabic text models. The best-performing systems achieved macro F1-scores of 0.723 (Task 1), 0.578 (Task 2), and 0.796 (Task 3), with top teams employing ensemble methods, class-weighted training, and OCR-aware multimodal fusion. Analysis reveals persistent challenges in dialectal robustness, minority class detection for hope speech, and highlights key directions for future Arabic content moderation research.

pdf bib
NYUAD at MAHED Shared Task: Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models
Nouar AlDahoul | Yasir Zaki

pdf bib
NguyenTriet at MAHED Shared Task: Ensemble of Arabic BERT Models with Hierarchical Prediction and Soft Voting for Text-Based Hope and Hate Detection
Nguyen Minh Triet | Thìn Đặng Văn

pdf bib
ANLPers at MAHED2025: From Hate to Hope: Boosting Arabic Text Classification
Yasser Alhabashi | Serry Sibaee | Omer Nacar | Adel Ammar | Wadii Boulila

pdf bib
LoveHeaven at MAHED 2025: Text-based Hate and Hope Speech Classification Using AraBERT-Twitter Ensemble
Nguyễn Thiên Bảo | Dang Van Thin

pdf bib
CIC-NLP at MAHED 2025 TASK 1:Assessing the Role of Bigram Augmentation in Multiclass Arabic Hate and Hope Speech Classification
Tolulope Olalekan Abiola | Oluwatobi Joseph Abiola | Ogunleye Temitope Dasola | Tewodros Achamaleh | Obiadoh Augustine Ekenedilichukwu

pdf bib
TranTranUIT at MAHED Shared Task: Multilingual Transformer Ensemble with Advanced Data Augmentation and Optuna-based Hyperparameter Optimization
Trinh Tran Tran | Thìn Đặng Văn

pdf bib
YassirEA at MAHED 2025: Fusion-Based Multimodal Models for Arabic Hate Meme Detection
Yassir El Attar

pdf bib
AAA at MAHED Text-based Hate and Hope Speech Classification: A Systematic Encoder Evaluation for Arabic Hope and Hate Speech Classification
Ahmed Khalil Elzainy | Mohamed Amin | Ahmed Samir | Hazem Abdelsalam

pdf bib
CUET-823 at MAHED 2025 Shared Task: Large Language Model-Based Framework for Emotion, Offensive, and Hate Detection in Arabic
Ratnajit Dhar | Arpita Mallik

pdf bib
AraMinds at MAHED 2025: Leveraging Vision-Language Models and Contrastive Multi-task Learning for Multimodal Hate Speech Detection
Mohamed Zaytoon | Ahmed Mahmoud Salem | Ahmed Sakr | Hossam Elkordi

pdf bib
DesCartes-HOPE at MAHED Shared task 2025: Integrating Pragmatic Features for Arabic Hope and Hate Speech Detection
Leila Moudjari | Hacène-Cherkaski Mélissa | Farah Benamara

pdf bib
ANLP-UniSo at MAHED Shared Task: Detection of Hate and Hope Speech in Arabic Social Media based on XLM-RoBERTa and Logistic Regression
Yasmine El Abed | Mariem Ben Arbia | Saoussen Ben Chaabene | Omar Trigui

pdf bib
REGLAT at MAHED Shared Task: A Hybrid Ensemble-Based System for Arabic Hate Speech Detection
Nsrin Ashraf | Mariam Labib | Tarek Elshishtawy | Hamada Nayel

pdf bib
HTU at MAHED Shared Task: Ensemble-Based Classification of Arabic Hate and Hope Speech Using Pre-trained Dialectal Arabic Models
Abdallah Saleh | Mariam M Biltawi

pdf bib
SmolLab_SEU at MAHED Shared Task: Do Arabic-Native Encoders Surpass Multilingual Models in Detecting the Nuances of Hope, Hate, and Emotion?
Md Abdur Rahman | Md Sabbir Dewan | Md. Tofael Ahmed Bhuiyan | Md Ashiqur Rahman

pdf bib
Baoflowin502 at MAHED Shared Task: Text-based Hate and Hope Speech Classification
Nguyen Minh Bao | Dang Van Thin

pdf bib
AyahVerse at MAHED Shared Task: Fine-Tuning ArabicBERT with Preprocessing for Hope and Hate Detection
Ibad-ur-Rehman Rashid | Muhammad Hashir Khalil

pdf bib
MultiMinds at MAHED 2025: Multimodal and Multitask Approaches for Detecting Emotional, Hate, and Offensive Speech in Arabic Content
Riddhiman Debnath | Abdul Wadud Shakib | Md Saiful Islam

pdf bib
joy_2004114 at MAHED Shared Task : Filtering Hate Speech from Memes using A Multimodal Fusion-based Approach
Joy Das | Alamgir Hossain | Mohammed Moshiul Hoque

pdf bib
Quasar at MAHED Shared Task : Decoding Emotions and Offense in Arabic Text using LLM and Transformer-Based Approaches
Md Sagor Chowdhury | Adiba Fairooz Chowdhury

pdf bib
CUET_Zahra_Duo@Mahed 2025: Hate and Hope Speech Detection in Arabic Social Media Content using Transformer
Walisa Alam | Mehreen Rahman | Shawly Ahsan | Mohammed Moshiul Hoque

pdf bib
AraNLP at MAHED 2025 Shared Task: Using AraBERT for Text-based Hate and Hope Speech Classification
Wafaa S. El-Kassas | Enas A. Hakim Khalil

pdf bib
Thinking Nodes at MAHED: A Comparative Study of Multimodal Architectures for Arabic Hateful Meme Detection
Itbaan Safwan

We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 five teams, 47 submissions for Subtask 2 six teams, and 19 submissions for Subtask 3 two teams. The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.

pdf bib
Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning
Mahmoud Salhab | Shameed Sait | Mohammad Abusheikh | Hasan Abusheikh

pdf bib
Lahjati at NADI 2025 A ECAPA-WavLM Fusion with Multi-Stage Optimization
Sanad Albawwab | Omar Qawasmeh

pdf bib
MarsadLab at NADI Shared Task: Arabic Dialect Identification and Speech Recognition using ECAPA-TDNN and Whisper
Md. Rafiul Biswas | Kais Attia | Shimaa Ibrahim | Mabrouka Bessghaier | Wajdi Zaghouani

pdf bib
ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks
Haroun Elleuch | Youssef Saidi | Salima Mdhaffar | Yannick Estève | Fethi Bougares

pdf bib
Unicorn at NADI 2025 Subtask 3: GEMM3N-DR: Audio-Text Diacritic Restoration via Fine-tuning Multimodal Arabic LLM
Mohamed Lotfy Elrefai

Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent. Ultimately, this benchmark provides a crucial, standardized framework to guide the development of more culturally grounded and competent Arabic LLMs. Results of the shared task demonstrate that general cultural and general religious knowledge remain challenging to LLMs, motivating us to continue to offer the shared task in the future.

pdf bib
Hamyaria at PalmX2025: Leveraging Large Language Models to Improve Arabic Multiple-Choice Questions in Cultural and Islamic Domains
Walid Al-Dhabyani | Hamzah A. Alsayadi

pdf bib
ISL-NLP at PalmX 2025: Retrieval-Augmented Fine-Tuning for Arabic Cultural Question Answering
Mohamed Gomaa | Noureldin Elmadany

pdf bib
ADAPT–MTU HAI at PalmX 2025: Leveraging Full and Parameter‐Efficient LLM Fine‐Tuning for Arabic Cultural QA
Shehenaz Hossain | Haithem Afli

pdf bib
CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation
Hunzalah Hassan Bhatti | Youssef Ahmed | Md Arid Hasan | Firoj Alam

pdf bib
MarsadLab at PalmX Shared Task: An LLM Benchmark for Arabic Culture and Islamic Civilization
Md. Rafiul Biswas | Shimaa Ibrahim | Kais Attia | Firoj Alam | Wajdi Zaghouani

pdf bib
Star at PalmX 2025: Arabic Cultural Understanding via Targeted Pretraining and Lightweight Fine-tuning
Eman Elrefai | Esraa Khaled | Alhassan Ehab

pdf bib
AYA at PalmX 2025: Modeling Cultural and Islamic Knowledge in LLMs
Jannatul Tajrin | Bir Ballav Roy | Firoj Alam

pdf bib
Cultura-Arabica: Probing and Enhancing Arabic Cultural Awareness in Large Language Models via LoRA
Pulkit Chatwal | Santosh Kumar Mishra

pdf bib
Phoenix at Palmx: Exploring Data Augmentation for Arabic Cultural Question Answering
Houdaifa Atou | Issam Ait Yahia | Ismail Berrada

This paper provides a comprehensive overview of the QIAS 2025 shared task, organized as part of the ArabicNLP 2025 conference and co-located with EMNLP 2025. The task was designed for the evaluation of large language models in the complex domains of religious and legal reasoning. It comprises two subtasks: (1) Islamic Inheritance Reasoning, requiring models to compute inheritance shares according to Islamic jurisprudence, and (2) Islamic Knowledge Assessment, which covers a range of traditional Islamic disciplines. Both subtasks were structured as multiple-choice question answering challenges, with questions stratified by varying difficulty levels. The shared task attracted significant interest, with 44 teams participating in the development phase, from which 18 teams advanced to the final test phase. Of these, 6 teams submitted entries for both subtasks, 8 for Task 1 only, and two for Task 3 only. Ultimately, 16 teams submitted system description papers. Herein, we detail the task’s motivation, dataset construction, evaluation protocol, and present a summary of the participating systems and their results.

pdf bib
NYUAD at QIAS Shared Task: Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases
Nouar AlDahoul | Yasir Zaki

pdf bib
SHA at the QIAS Shared Task: LLMs for Arabic Islamic Inheritance Reasoning
Shatha Altammami

pdf bib
N&N at QIAS 2025: Chain-of-Thought Ensembles with Retrieval-Augmented framework for Classical Arabic Islamic
Nourah Alangari | Nouf AlShenaifi

pdf bib
HIAST at QIAS 2025: Retrieval-Augmented LLMs with Top-Hit Web Evidence for Arabic Islamic Reasoning QA
Mohamed Motasim Hamed | Nada Ghneim | Riad Sonbol

pdf bib
QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning
Mohammad AL-Smadi

pdf bib
Transformer Tafsir at QIAS 2025 Shared Task: Hybrid Retrieval-Augmented Generation for Islamic Knowledge Question Answering
Muhammad Abu Ahmad | Mohamad Ballout | Raia Abu Ahmad | Elia Bruni

pdf bib
PuxAI at QIAS 2025: Multi-Agent Retrieval-Augmented Generation for Islamic Inheritance and Knowledge Reasoning
Nguyen Xuan Phuc | Thìn Đặng Văn

pdf bib
Athar at QIAS2025: LLM-based Question Answering Systems for Islamic Inheritance and Classical Islamic Knowledge
Yossra Noureldien | Hassan Suliman | Farah Attallah | Abdelrazig Mohamed | Sara Abdalla

pdf bib
ADAPT–MTU HAI at QIAS2025: Dual-Expert LLM Fine-Tuning and Constrained Decoding for Arabic Islamic Inheritance Reasoning
Shehenaz Hossain | Haithem Afli

pdf bib
CVPD at QIAS 2025 Shared Task: An Efficient Encoder-Based Approach for Islamic Inheritance Reasoning
Salah Eddine Bekhouche | Abdellah Zakaria Sellam | Telli Hichem | Cosimo Distante | Abdenour Hadid

pdf bib
CIS-RG at QIAS 2025 Shared Task: Approaches for Enhancing Performance of LLM on Islamic Legal Reasoning and its Mathematical Calculations
Osama Farouk Zaki

pdf bib
SEA-Team at QIAS 2025: Enhancing LLMs for Question Answering in Islamic Texts
Sanaa Alowaidi

pdf bib
MorAI at QIAS 2025: Collaborative LLM via Voting and Retrieval-Augmented Generation for Solving Complex Inheritance Problems
Jihad R’baiti | Chouaib El Hachimi | Youssef Hmamouche | Amal Seghrouchni

pdf bib
Gumball at QIAS 2025: Arabic LLM Automated Reasoning in Islamic Inheritance
Eman Elrefai | Mohamed Lotfy Elrefai | Aml Hassan Esmail

pdf bib
Tokenizers United at QIAS-2025: RAG-Enhanced Question Answering for Islamic Studies by Integrating Semantic Retrieval with Generative Reasoning
Mayar Boghdady

Automated Essay Scoring (AES) has emerged as a significant research problem in natural language processing, offering valuable tools to support educators in assessing student writing. Motivated by the growing need for reliable Arabic AES systems, we organized the first shared Task for Arabic Quality Evaluation of Essays in Multi-dimensions (TAQEEM) held at the ArabicNLP 2025 conference. TAQEEM 2025 includes two subtasks: Task A on holistic scoring and Task B on trait-specific scoring. It introduces a new (and first of its kind) dataset of 1,265 Arabic essays, annotated with holistic and trait-specific scores, including relevance, organization, vocabulary, style, development, mechanics, and grammar. The main goal of TAQEEM is to address the scarcity of standardized benchmarks and high-quality resources in Arabic AES. TAQEEM 2025 attracted 11 registered teams for Task A and 10 for Task B, with a total of 5 teams, across both tasks, submitting system runs for evaluation. This paper presents an overview of the task, outlines the approaches employed, and discusses the results of the participating teams.

pdf bib
912 at TAQEEM 2025: A Distribution-aware Approach to Arabic Essay Scoring
Trong-Tai Dam Vu | Thìn Đặng Văn

pdf bib
Taibah at TAQEEM 2025: Leveraging GPT-4o for Arabic Essay Scoring
Nada Almarwani | Alaa Alharbi | Samah Aloufi

pdf bib
MarsadLab at TAQEEM 2025: Prompt-Aware Lexicon-Enhanced Transformer for Arabic Automated Essay Scoring
Mabrouka Bessghaier | Md. Rafiul Biswas | Amira Dhouib | Wajdi Zaghouani