Tianlong Xu
2026
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation
2025
Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
Hang Li | Tianlong Xu | Kaiqi Yang | Yucheng Chu | Yanling Chen | Yichi Song | Qingsong Wen | Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Li | Tianlong Xu | Kaiqi Yang | Yucheng Chu | Yanling Chen | Yichi Song | Qingsong Wen | Hui Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.