Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic

Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan


Abstract
Large language models (LLMs) achieve impressive results on advanced mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising the question of whether they have truly grasped fundamental arithmetic rules or are merely relying on pattern matching. To unravel this issue, we systematically probe LLMs’ understanding of two-integer addition (0 to 264) by testing three crucial properties: commutativity (A+B=B+A), representation invariance via symbolic remapping (e.g., 7 ↦ Y), and consistent accuracy scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark disconnect: while models achieve high numeric accuracy (73.8–99.8%), they systematically fail these diagnostics. Specifically, accuracy plummets to ≤ 7.5% with symbolic inputs, commutativity is violated in up to 20% of cases, and accuracy scaling is non-monotonic. Interventions further expose this pattern-matching reliance: explicitly providing rules degrades performance by 29.49%, while prompting for explanations before answering merely maintains baseline accuracy. These findings demonstrate that current LLMs address elementary addition via pattern matching, not robust rule induction, motivating new diagnostic benchmarks and innovations in model architecture and training to cultivate genuine mathematical reasoning. Our dataset and generating code are available at https://github.com/kuri-leo/llm-arithmetic-diagnostic.
Anthology ID:
2025.emnlp-main.681
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13478–13494
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.681/
DOI:
Bibkey:
Cite (ACL):
Yang Yan, Yu Lu, Renjun Xu, and Zhenzhong Lan. 2025. Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13478–13494, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic (Yan et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.681.pdf
Checklist:
 2025.emnlp-main.681.checklist.pdf