Jiamin Su
2026
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
Yibo Yan | Shen Wang | Jiahao Huo | Hang Li | Boyan Li | Jiamin Su | Xiong Gao | YiFan Zhang | Tianlong Xu | Zhendong Chu | Aoxiao Zhong | Kun Wang | Hui Xiong | Philip S. Yu | Xuming Hu | Qingsong Wen
Findings of the Association for Computational Linguistics: ACL 2026
As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to handle mathematical reasoning tasks is promising, as they can handle multimodal questions via cross-modal understanding capabilities compared to text-only LLMs. Current mathematical benchmarks predominantly focus on evaluating MLLMs’ problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task — multimodal error detection, and introduce **ErrorRadar, the first benchmark designed to assess MLLMs’ capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization**, providing a framework for evaluating MLLMs’ complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions in an educational organization, with expert-based annotation and metadata such as problem type and error category. Through extensive experiments, we evaluated both open-source and closed-source representative MLLMs, benchmarking their performance against educational expert evaluators. Results indicate challenges still remain, as GPT-4o with best model performance is still around 10% behind human evaluation
2025
EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Jiamin Su | Yibo Yan | Fangteng Fu | Zhang Han | Jingheng Ye | Xiang Liu | Jiahao Huo | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Jiamin Su | Yibo Yan | Fangteng Fu | Zhang Han | Jingheng Ye | Xiang Liu | Jiahao Huo | Huiyu Zhou | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (i) reliance on handcrafted features that limit generalizability, (ii) difficulty in capturing fine-grained traits like coherence and argumentation, and (iii) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose **EssayJudge**, the **first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits**. By leveraging MLLMs’ strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Yibo Yan | Jiamin Su | Jianxiang He | Fangteng Fu | Xu Zheng | Yuanhuiyi Lyu | Kun Wang | Shen Wang | Qingsong Wen | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Yibo Yan | Jiamin Su | Jianxiang He | Fangteng Fu | Xu Zheng | Yuanhuiyi Lyu | Kun Wang | Shen Wang | Qingsong Wen | Xuming Hu
Findings of the Association for Computational Linguistics: ACL 2025
Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides **the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs)**. We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.
PhysicsArena: The First Multimodal Physics Reasoning Benchmark Exploring Variable, Process, and Solution Dimensions
Song Dai | Yibo Yan | Jiamin Su | Zihao Dongfang | Yubo Gao | Yonghua Hei | Jungang Li | Junyan Zhang | Sicheng Tao | Zhuoran Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Song Dai | Yibo Yan | Jiamin Su | Zihao Dongfang | Yubo Gao | Yonghua Hei | Jungang Li | Junyan Zhang | Sicheng Tao | Zhuoran Gao | Xuming Hu
Findings of the Association for Computational Linguistics: EMNLP 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in diverse reasoning tasks, yet their application to complex physics reasoning remains underexplored. Physics reasoning presents unique challenges, requiring grounding in physical conditions and the interpretation of multimodal information. Current physics benchmarks are limited, often focusing on text-only inputs or solely on problem-solving, thereby overlooking the critical intermediate steps of variable identification and process formulation. To address these limitations, we introduce **PhysicsArena, the first multimodal physics reasoning benchmark designed to holistically evaluate MLLMs across three critical dimensions: variable identification, physical process formulation, and solution derivation.** PhysicsArena aims to provide a comprehensive platform for assessing and advancing the multimodal physics reasoning abilities of MLLMs.
Search
Fix author
Co-authors
- Xuming Hu 4
- Yibo Yan 4
- Fangteng Fu 2
- Jiahao Huo 2
- Shen Wang 2
- Kun Wang 2
- Qingsong Wen 2
- Zhendong Chu 1
- Song Dai 1
- Zihao Dongfang 1
- Xiong Gao 1
- Yubo Gao 1
- Zhuoran Gao 1
- Zhang Han 1
- Jianxiang He 1
- Yonghua Hei 1
- Hang Li 1
- Boyan Li 1
- Jungang Li 1
- Xiang Liu 1
- Yuanhuiyi Lyu 1
- Sicheng Tao 1
- Hui Xiong 1
- Tianlong Xu 1
- Jingheng Ye 1
- Philip S. Yu 1
- Yifan Zhang 1
- Junyan Zhang 1
- Xu Zheng 1
- Aoxiao Zhong 1
- Huiyu Zhou 1