Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li; Wenjie Wang; Moxin Li; Junrong Guo; Yang Zhang; Fuli Feng

doi:10.18653/v1/2024.findings-acl.673

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, Fuli Feng

Abstract

The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction.From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro.Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs.Our code and dataset is available on https://github.com/LittleCirc1e/EIC.

Anthology ID:: 2024.findings-acl.673
Volume:: Findings of the Association for Computational Linguistics ACL 2024
Month:: August
Year:: 2024
Address:: Bangkok, Thailand and virtual meeting
Editors:: Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11316–11360
Language:
URL:: https://aclanthology.org/2024.findings-acl.673
DOI:: 10.18653/v1/2024.findings-acl.673
Bibkey:
Cite (ACL):: Xiaoyuan Li, Wenjie Wang, Moxin Li, Junrong Guo, Yang Zhang, and Fuli Feng. 2024. Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction. In Findings of the Association for Computational Linguistics ACL 2024, pages 11316–11360, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction (Li et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-2024-clasp/2024.findings-acl.673.pdf

PDF Search