ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan; Shen Wang; Jiahao Huo; Hang Li; Boyan Li; Jiamin Su; Xiong Gao; Yifan Zhang; Tianlong Xu; Zhendong Chu; Aoxiao Zhong; Kun Wang; Hui Xiong; Philip S. Yu; Xuming Hu; Qingsong Wen

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, YiFan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen

Abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation

Anthology ID:: 2026.findings-acl.1217
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24297–24334
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1217/
DOI:
Bibkey:
Cite (ACL):: Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, YiFan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, and Qingsong Wen. 2026. ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection. In Findings of the Association for Computational Linguistics: ACL 2026, pages 24297–24334, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection (Yan et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1217.pdf
Checklist:: 2026.findings-acl.1217.checklist.pdf

PDF Cite Search Checklist Fix data