Gagan Bhatia
2026
Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA
Ummar Abbas | Mourad Ouzzani | Mohamed Y. Eltabakh | Omar Sinan | Gagan Bhatia | Hamdy Mubarak | Majd Hawasly | Mohammed Qusay Hashim | Kareem Mohamed Darwish | Firoj Alam
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Ummar Abbas | Mourad Ouzzani | Mohamed Y. Eltabakh | Omar Sinan | Gagan Bhatia | Hamdy Mubarak | Majd Hawasly | Mohammed Qusay Hashim | Kareem Mohamed Darwish | Firoj Alam
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur’an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) improves grounding, however, a single retrieve-then-generate pipeline is insufficient for diverse Islamic queries, including verbatim scripture, citation-grounded guidance, and rule-constrained computations such as zakat and inheritance. To address these challenges, we present Fanar-Sadiq, a bilingual Arabic-English Islamic QA system built on a multi-agent, tool-augmented architecture. It is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic queries to specialized modules within an agentic tool architecture. It supports intent-aware routing, retrieval-grounded fiqh answers with normalized citations and verification traces, exact verse lookup with quotation validation, and deterministic Sunni zakat and inheritance calculators with madhhab-sensitive branching. We evaluate the end-to-end system on public Islamic QA benchmarks and show strong effectiveness and efficiency. It is publicly accessible through an API and Web application and has received over 1.9M accesses in less than a year (https://api.fanar.qa/docs).
2025
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Gagan Bhatia | MingZe Tang | Cristina Mahanta | Madiha Kazi | Maxime Peyrard | Wei Zhao
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Gagan Bhatia | MingZe Tang | Cristina Mahanta | Madiha Kazi | Maxime Peyrard | Wei Zhao
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
We introduce DateLogicQA, a human-curated benchmark of 190 questions specifically designed to understand temporal bias in Large Language Models (LLMs). Covering seven date formats across past, present, and future contexts, DateLogicQA examines four reasoning types: commonsense, factual, conceptual, and numerical. Through human-led evaluations of 12 state-of-the-art LLMs, we identify Representation-Level Bias, arising from suboptimal embeddings that distort date semantics, and Logical-Level Bias, manifesting when correct date tokens yield flawed temporal reasoning. Our findings underscore persistent challenges in handling various date formats and temporal contexts, revealing the need for more robust pretraining data, targeted post-training methods, and precise tokenization strategies. By illuminating these biases, we provide actionable insights to guide the development of LLMs for accurate temporal reasoning across diverse real-world applications.
Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks
Gagan Bhatia | El Moatez Billah Nagoudi | Abdellah El Mekki | Fakhraddin Alwajih | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: NAACL 2025
Gagan Bhatia | El Moatez Billah Nagoudi | Abdellah El Mekki | Fakhraddin Alwajih | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: NAACL 2025
In this paper, we introduce Swan, a family of embedding models centred around the Arabic language, addressing both small-scale and large-scale use cases. Swan includes two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model. To evaluate these models, we propose ArabicMTEB, a comprehensive benchmark suite that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance, covering eight diverse tasks and spanning 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while the Swan-Small consistently surpasses Multilingual-E5-base. Our extensive evaluations demonstrate that Swan models are dialectally and culturally aware, excelling across various Arabic domains while offering significant monetary efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic natural language processing. Our models and benchmarks will be made publicly accessible for research.
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Gagan Bhatia | Maxime Peyrard | Wei Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Gagan Bhatia | Maxime Peyrard | Wei Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Modern BPE tokenisers often split calendar dates into meaningless fragments, e.g., “20250312” → “202”, “503”, “12”, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokeniser preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction heals date fragments. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year → month → day).
2024
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
Gagan Bhatia | El Moatez Billah Nagoudi | Hasan Cavusoglu | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2024
Gagan Bhatia | El Moatez Billah Nagoudi | Hasan Cavusoglu | Muhammad Abdul-Mageed
Findings of the Association for Computational Linguistics: ACL 2024
We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual, numerical, tabular, and image data. We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training by exploiting a large collection of textual and visual datasets we curate for this work. We also introduce an extensive benchmark featuring nine tasks and 25 datasets for evaluation, including hallucinations in the financial domain. Our FinTral model trained with direct preference optimization employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R, demonstrates an exceptional zero-shot performance. It outperforms ChatGPT-3.5 in all tasks and surpasses GPT-4 in five out of nine tasks, marking a significant advancement in AI-driven financial technology. We also demonstrate that FinTral has the potential to excel in real-time analysis and decision-making in diverse financial contexts.
Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic
Fakhraddin Alwajih | Gagan Bhatia | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Fakhraddin Alwajih | Gagan Bhatia | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high-quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed ***Dallah***, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. ***Dallah*** demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, ***Dallah*** showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, ***Dallah*** has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
Qalam: A Multimodal LLM for Arabic Optical Character and Handwriting Recognition
Gagan Bhatia | El Moatez Billah Nagoudi | Fakhraddin Alwajih | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Gagan Bhatia | El Moatez Billah Nagoudi | Fakhraddin Alwajih | Muhammad Abdul-Mageed
Proceedings of the Second Arabic Natural Language Processing Conference
Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces ***Qalam***, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train ***Qalam*** on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, ***Qalam*** demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore ***Qalam***’s potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency.
Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks
Fakhraddin Alwajih | El Moatez Billah Nagoudi | Gagan Bhatia | Abdelrahman Mohamed | Muhammad Abdul-Mageed
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fakhraddin Alwajih | El Moatez Billah Nagoudi | Gagan Bhatia | Abdelrahman Mohamed | Muhammad Abdul-Mageed
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have proven effective in a wide range of tasks that require complex reasoning and linguistic comprehension. However, due to a lack of high-quality multimodal resources in languages other than English, the success of MLLMs remains relatively limited to English-based settings. This poses significant challenges in developing comparable models for other languages, even those with large speaker populations, such as Arabic. To alleviate this challenge, we introduce a comprehensive family of Arabic MLLMs, dubbed *Peacock*, with strong vision and language capabilities. Through comprehensive qualitative and quantitative analysis, we demonstrate the solid performance of our models on various visual reasoning tasks and further show their emerging dialectal potential. Additionally, we introduce *Henna*, a new benchmark specifically designed for assessing MLLMs on aspects related to Arabic culture, setting the first stone for culturally-aware Arabic MLLMs. The GitHub repository for the *Peacock* project is available at [https://github.com/UBC-NLP/peacock](https://github.com/UBC-NLP/peacock).
2023
SIDLR: Slot and Intent Detection Models for Low-Resource Language Varieties
Sang Yun Kwon | Gagan Bhatia | Elmoatez Billah Nagoudi | Alcides Alcoba Inciarte | Muhammad Abdul-mageed
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Sang Yun Kwon | Gagan Bhatia | Elmoatez Billah Nagoudi | Alcides Alcoba Inciarte | Muhammad Abdul-mageed
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Intent detection and slot filling are two critical tasks in spoken and natural language understandingfor task-oriented dialog systems. In this work, we describe our participation in slot and intent detection for low-resource language varieties (SID4LR) (Aepli et al., 2023). We investigate the slot and intent detection (SID) tasks using a wide range of models and settings. Given the recent success of multitask promptedfinetuning of the large language models, we also test the generalization capability of the recent encoder-decoder model mT0 (Muennighoff et al., 2022) on new tasks (i.e., SID) in languages they have never intentionally seen. We show that our best model outperforms the baseline by a large margin (up to +30 F1 points) in both SID tasks.
UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis
Gagan Bhatia | Ife Adebara | Abdelrahim Elmadany | Muhammad Abdul-mageed
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Gagan Bhatia | Ife Adebara | Abdelrahim Elmadany | Muhammad Abdul-mageed
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six language models, including further pretraining of some of these models as well as a final finetuning stage. Our best performing models achieve an F1-score of 70.36 on development data and an F1-score of 66.13 on test data. Unsurprisingly, our results demonstrate the effectiveness of transfer learning and finetuning techniques for sentiment analysis across multiple languages. Our approach can be applied to other sentiment analysis tasks in different languages and domains.
Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction
Sang Kwon | Gagan Bhatia | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed
Proceedings of ArabicNLP 2023
Sang Kwon | Gagan Bhatia | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed
Proceedings of ArabicNLP 2023
Large language models (LLMs) finetuned to follow human instruction have recently exhibited significant capabilities in various English NLP tasks. However, their performance in grammatical error correction (GEC), especially on languages other than English, remains significantly unexplored. In this work, we evaluate the abilities of instruction finetuned LLMs in Arabic GEC, a complex task due to Arabic’s rich morphology. Our findings suggest that various prompting methods, coupled with (in-context) few-shot learning, demonstrate considerable effectiveness, with GPT-4 achieving up to 65.49 F1 score under expert prompting (approximately 5 points higher than our established baseline). Despite these positive results, we find that instruction finetuned models, regardless of their size, are still outperformed by fully finetuned ones, even if they are significantly smaller in size. This disparity highlights substantial room for improvements for LLMs. Inspired by methods used in low-resource machine translation, we also develop a method exploiting synthetic data that significantly outperforms previous models on two standard Arabic benchmarks. Our best model achieves a new SOTA on Arabic GEC, with 73.29 and 73.26 F1 on the 2014 and 2015 QALB datasets, respectively, compared to peer-reviewed published baselines.
Search
Fix author
Co-authors
- Muhammad Abdul-Mageed 8
- El-Moatez-Billah Nagoudi 5
- Fakhraddin Alwajih 4
- Maxime Peyrard 2
- Wei Zhao 2
- Ummar Abbas 1
- Ife Adebara 1
- Firoj Alam 1
- Alcides Alcoba Inciarte 1
- Hasan Cavusoglu 1
- Kareem Mohamed Darwish 1
- Abdellah El Mekki 1
- Abdelrahim Elmadany 1
- Mohamed Y. Eltabakh 1
- Mohammed Qusay Hashim 1
- Majd Hawasly 1
- Madiha Kazi 1
- Sang Kwon 1
- Sang Yun Kwon 1
- Cristina Mahanta 1
- Abdelrahman Mohamed 1
- Hamdy Mubarak 1
- Elmoatez Billah Nagoudi 1
- Mourad Ouzzani 1
- Omar Sinan 1
- MingZe Tang 1