From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?

Xisheng Xiao; Hanlin Zhao

From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?

Abstract

Large language models (LLMs) have demonstrated strong performance in solving math problems, and there is growing research on evaluating their robustness. Unlike previous studies that create problem variants by adding perturbations to a single problem, this paper focuses on the interaction between problems. Specifically, we combine two original problems with a logical connection to get a new math problem, and measure the LLMs’ performance on it to evaluate its compositional generalization, which is an important and essential reasoning capability in human intelligence. The result of experiments that cover 14 different LLMs shows that even when the mathematical essence remains unchanged, a simple form of combination can significantly reduce the performance of LLMs, revealing the limitation of their generalization ability. Additionally, we propose an automated pipeline with 98.2% accuracy to assist in annotating datasets (1 manual, 2 synthetic). The extensive experiments conducted on these datasets further verify the conclusion and obtain some important findings. Finally, we analyze the impact of factors such as difficulty and length on LLMs’ performance, offering insights for future research.

Anthology ID:: 2025.emnlp-main.660
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13068–13089
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.660/
DOI:
Bibkey:
Cite (ACL):: Xisheng Xiao and Hanlin Zhao. 2025. From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13068–13089, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: From A and B to A+B: Can Large Language Models Solve Compositional Math Problems? (Xiao & Zhao, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.660.pdf
Checklist:: 2025.emnlp-main.660.checklist.pdf

PDF Cite Search Checklist Fix data