Rania Elbadry
2026
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
Instruction-Guided Poetry Generation in Arabic and Its Dialects
Abdelrahman Sadallah | Kareem Elozeiri | Mervat Abassy | Rania Elbadry | Mohamed Anwar | Abed Alhakim Freihat | Preslav Nakov | Fajri Koto
Findings of the Association for Computational Linguistics: ACL 2026
Abdelrahman Sadallah | Kareem Elozeiri | Mervat Abassy | Rania Elbadry | Mohamed Anwar | Abed Alhakim Freihat | Preslav Nakov | Fajri Koto
Findings of the Association for Computational Linguistics: ACL 2026
Poetry has long been a central art form for Arabic speakers, serving as a powerful medium of expression and cultural identity. While modern Arabic speakers continue to value poetry, existing research on Arabic poetry within Large Language Models (LLMs) has primarily focused on analysis tasks such as interpretation or metadata prediction, e.g., rhyme schemes and titles. In contrast, our work addresses the practical aspect of poetry creation in Arabic by introducing controllable generation capabilities to assist users in writing poetry. Specifically, we present a large-scale, carefully curated instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects. This dataset enables tasks such as writing, revising, and continuing poems based on predefined criteria, including style and rhyme, as well as performing poetry analysis. Our experiments show that fine-tuning LLMs on this dataset yields models that can effectively generate poetry that is aligned with user requirements, based on both automated metrics and human evaluation with native Arabic speakers. The data and the code are available at https://github.com/mbzuai-nlp/instructpoet-ar
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event–cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event–cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
2025
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning within the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning (ICL) and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning (SFT) and direct preference Optimization (DPO). Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Ahmed Heakl | Muhammad Abdullah Sohail | Mukul Ranjan | Rania Elbadry | Ghazi Shazan Ahmad | Mohamed El-Geish | Omar Maher | Zhiqiang Shen | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Ahmed Heakl | Muhammad Abdullah Sohail | Mukul Ranjan | Rania Elbadry | Ghazi Shazan Ahmad | Mohamed El-Geish | Omar Maher | Zhiqiang Shen | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 subdomains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (such as EasyOCR, PaddleOCR, and Surya) by an average of 60% in the character error rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges of accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
Search
Fix author
Co-authors
- Preslav Nakov 4
- Fajri Koto 3
- Sophia Ananiadou 2
- Ahmed Heakl 2
- Jimin Huang 2
- Salem Lahlou 2
- Xueqing Peng 2
- Veselin Stoyanov 2
- Yuxia Wang 2
- Zhuohan Xie 2
- Mervat Abassy 1
- Ghazi Shazan Ahmad 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhra AlMahri 1
- Saeed Almheiri 1
- Mohamed Anwar 1
- Mena Attia 1
- Timothy Baldwin 1
- Debopriyo Banerjee 1
- Dani Bouch 1
- Tanmoy Chakraborty 1
- Mohamed El-Geish 1
- Kareem Elozeiri 1
- Abed Alhakim Freihat 1
- Georgi Nenkov Georgiev 1
- Marwa Elsaid Khalil 1
- Fahad Shahbaz Khan 1
- Salman Khan 1
- Ivan Koychev 1
- Haonan Li 1
- Hachem Madmoun 1
- Omar Maher 1
- Daniil Orel 1
- Lingfei Qian 1
- Mukul Ranjan 1
- Abdelrahman Sadallah 1
- Dhruv Sahnan 1
- Zhiqiang Shen 1
- Aaryamonvikram Singh 1
- Muhammad Abdullah Sohail 1
- Jinyan Su 1
- Rushil Thareja 1
- Chenxi Wang 1
- Rui Xing 1
- Chen Xu 1
- Fan Zhang 1