Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models
Gaurav Srivastava, Aafiya Shamshad Hussain, Sriram Srinivasan, Xuan Wang
Abstract
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present **LLMThinkBench**, a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. **First,** we formalize the *accuracy-verbosity tradeoff*. **Second,** we introduce the *Overthinking Score*, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. **Third,** we establish an evaluation protocol with dynamically-generated data across **14** basic math tasks. **Fourth,** we conduct a large-scale empirical study evaluating **53** LLMs, including reasoning and quantized variants across different reasoning budgets. **Fifth,** we release **LLMThinkBench** as an open-source Python package and public leaderboard for reproducibility. Our findings reveal: ****1)**** model performance on complex benchmarks does not translate directly to basic math reasoning; ****2)**** reasoning models generate **∼18× more tokens** while sometimes achieving **lower accuracy** and exhibit catastrophic collapse when tokens are constrained, dropping by up to **∼36%**; ****3)**** the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from **low → medium → high** reasoning effort). *Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.* Our public leaderboard is available at https://ctrl-gaurav.github.io/LLMThinkBench/. Our open-source Python package is available at https://pypi.org/project/llmthinkbench/, and the codebase can be found at https://github.com/ctrl-gaurav/LLMThinkBench for easy and reproducible evaluation.- Anthology ID:
- 2026.findings-acl.1285
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25784–25826
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1285/
- DOI:
- Cite (ACL):
- Gaurav Srivastava, Aafiya Shamshad Hussain, Sriram Srinivasan, and Xuan Wang. 2026. Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25784–25826, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models (Srivastava et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1285.pdf