BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali

Kawsar Ahmed, Md Osama, Omar Sharif, Eftekhar Hossain, Mohammed Moshiul Hoque


Abstract
Large Language Models (LLMs) demonstrate exceptional proficiency in general-purpose tasks but struggle with numerical reasoning, particularly in low-resource languages like Bengali. Despite advancements, limited research has explored their numerical reasoning capabilities in these languages. To address this gap, we present BenNumEval (Bengali Numerical Evaluation), a benchmark designed to assess LLMs on numerical reasoning tasks in Bengali. It comprises six diverse tasks and a total of 3.2k samples curated from real-world problem-solving scenarios. Our extensive evaluations reveal that even with advanced prompting techniques such as Cross-Lingual Prompting (XLP) and Cross-Lingual Chain-of-Thought Prompting (XCoT), LLMs fall notably short of human-level performance, particularly when using Bengali Native Prompting (BNaP). These findings underscore the substantial gap between current LLM capabilities and human expertise in numerical reasoning, highlighting the need for more robust and linguistically inclusive AI models to advance Bengali Language Processing and equitable AI development. The source code for the system and evaluation pipeline is publicly available on GitHub.
Anthology ID:
2025.findings-acl.915
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17782–17799
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.915/
DOI:
Bibkey:
Cite (ACL):
Kawsar Ahmed, Md Osama, Omar Sharif, Eftekhar Hossain, and Mohammed Moshiul Hoque. 2025. BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17782–17799, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali (Ahmed et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.915.pdf