Rania Elbadry
2026
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhuohan Xie | Daniil Orel | Rushil Thareja | Dhruv Sahnan | Hachem Madmoun | Fan Zhang | Debopriyo Banerjee | Georgi Nenkov Georgiev | Xueqing Peng | Lingfei Qian | Jimin Huang | Jinyan Su | Aaryamonvikram Singh | Rui Xing | Rania Elbadry | Chen Xu | Haonan Li | Fajri Koto | Ivan Koychev | Tanmoy Chakraborty | Yuxia Wang | Salem Lahlou | Veselin Stoyanov | Sophia Ananiadou | Preslav Nakov
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation.To assess reasoning capacity, we propose ChainEval, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.
SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rania Elbadry | Sarfraz Ahmad | Ahmed Heakl | Dani Bouch | Momina Ahsan | Muhra AlMahri | Marwa Elsaid Khalil | Yuxia Wang | Salem Lahlou | Sophia Ananiadou | Veselin Stoyanov | Jimin Huang | Xueqing Peng | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event–cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event–cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
FinReporting: An Agentic Workflow for Localized Reporting of Cross-Jurisdiction Financial Disclosure
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Fan Zhang | Mingzi Song | Rania Elbadry | Yankai Chen | Shaobo Wang | Yixi Zhou | Xunwen Zheng | Yueru He | Yuyang Dai | Georgi Nenkov Georgiev | Ayesha Gull | Muhammad Usman Safder | Fan Wu | Liyuan Meng | Fengxian Ji | Junning Zhao | Xueqing Peng | Jimin Huang | YU Chen | Xue Liu | Preslav Nakov | Zhuohan Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Financial reporting systems increasingly leverage Large Language Models (LLMs) to extract and summarize corporate disclosures. However, most existing approaches assume a single-market setting and overlook structural differences across jurisdictions. Variations in accounting taxonomies, tagging infrastructures (e.g., XBRL vs. PDF), and aggregation conventions introduce substantial challenges for semantic alignment and reliable verification. Here, we aim to bridge this gap. We present FinReporting, an agentic workflow for localized cross-jurisdiction financial reporting. The system constructs a unified canonical ontology spanning the income statement, balance sheet, and cash flow statement, and decomposes reporting into auditable stages, including filing acquisition, extraction, canonical mapping, and anomaly logging. Rather than treating LLMs as free-form generators, FinReporting employs them as constrained verifiers operating under explicit decision rules with evidence grounding.Evaluated on annual filings from the USA, Japan, and China, FinReporting improves consistency and reliability under heterogeneous reporting regimes. We further release an interactive demo that enables cross-market inspection and supports structured export of localized financial statements. Our demo is available at https://huggingface.co/spaces/BoomQ/FinReporting-Demo. A video describing our system is available at https://www.youtube.com/watch?v=f65jdEL31Kk.
2025
Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Saeed Almheiri | Rania Elbadry | Mena Attia | Chenxi Wang | Preslav Nakov | Timothy Baldwin | Fajri Koto
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning within the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning (ICL) and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning (SFT) and direct preference Optimization (DPO). Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Ahmed Heakl | Muhammad Abdullah Sohail | Mukul Ranjan | Rania Elbadry | Ghazi Shazan Ahmad | Mohamed El-Geish | Omar Maher | Zhiqiang Shen | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Ahmed Heakl | Muhammad Abdullah Sohail | Mukul Ranjan | Rania Elbadry | Ghazi Shazan Ahmad | Mohamed El-Geish | Omar Maher | Zhiqiang Shen | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 subdomains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (such as EasyOCR, PaddleOCR, and Surya) by an average of 60% in the character error rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges of accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
Search
Fix author
Co-authors
- Preslav Nakov 4
- Jimin Huang 3
- Xueqing Peng 3
- Zhuohan Xie 3
- Sophia Ananiadou 2
- Georgi Nenkov Georgiev 2
- Ahmed Heakl 2
- Fajri Koto 2
- Salem Lahlou 2
- Veselin Stoyanov 2
- Yuxia Wang 2
- Fan Zhang 2
- Ghazi Shazan Ahmad 1
- Sarfraz Ahmad 1
- Momina Ahsan 1
- Muhra AlMahri 1
- Saeed Almheiri 1
- Mena Attia 1
- Timothy Baldwin 1
- Debopriyo Banerjee 1
- Dani Bouch 1
- Tanmoy Chakraborty 1
- YU Chen (陈昱) 1
- Yankai Chen 1
- Yuyang Dai 1
- Mohamed El-Geish 1
- Ayesha Gull 1
- Yueru He 1
- Fengxian Ji 1
- Marwa Elsaid Khalil 1
- Fahad Shahbaz Khan 1
- Salman Khan 1
- Ivan Koychev 1
- Haonan Li 1
- Xue Liu 1
- Hachem Madmoun 1
- Omar Maher 1
- Liyuan Meng 1
- Daniil Orel 1
- Lingfei Qian 1
- Mukul Ranjan 1
- Muhammad Usman Safder 1
- Dhruv Sahnan 1
- Zhiqiang Shen 1
- Aaryamonvikram Singh 1
- Muhammad Abdullah Sohail 1
- Mingzi Song 1
- Jinyan Su 1
- Rushil Thareja 1
- Chenxi Wang 1
- Shaobo Wang 1
- Fan Wu 1
- Rui Xing 1
- Chen Xu 1
- Junning Zhao 1
- Xunwen Zheng 1
- Yixi Zhou 1