Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Zena Al Khalili; Nick Howell; Dietrich Klakow

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Zena Al Khalili, Nick Howell, Dietrich Klakow

Abstract

Assisting LLMs with code generation improved their performanceon mathematical reasoning tasks.However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs.In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness.Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest.Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs’ limits in the math domain.

Anthology ID:: 2025.gem-1.64
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Ofir Arviv, Miruna Clinciu, Kaustubh Dhole, Rotem Dror, Sebastian Gehrmann, Eliya Habba, Itay Itzhak, Simon Mille, Yotam Perlitz, Enrico Santus, João Sedoc, Michal Shmueli Scheuer, Gabriel Stanovsky, Oyvind Tafjord
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 741–758
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-1/2025.gem-1.64/
DOI:
Bibkey:
Cite (ACL):: Zena Al Khalili, Nick Howell, and Dietrich Klakow. 2025. Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 741–758, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics (Khalili et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2025.gem-1.64.pdf

PDF Cite Search Fix data