We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao; Qiuna Tan; Guanting Dong; MinhuiWu MinhuiWu; Chong Sun; Xiaoshuai Song; Jiapeng Wang; Zhuoma GongQue; Shanglin Lei; Yifan Zhang; Zhe Wei; Miaoxuan Zhang; Runfeng Qiao; Xiao Zong; Yida Xu; Peiqing Yang; Zhimin Bao; Muxi Diao; Chen Li (李辰); Honggang Zhang

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, YiFan Zhang, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Xiao Zong, Yida Xu, Peiqing Yang, Zhimin Bao, Muxi Diao, Chen Li, Honggang Zhang

Abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks mainly focus more on the end-to-end performance, but neglect the underlying principles of knowledge acquisition and generalization. Instead, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles. We meticulously collect 6.5K visual math problems and decompose them into 10.9K step-level questions for evaluation, spanning 5 layers of knowledge granularity and 67 hierarchical knowledge concepts. Specifically, we decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and provide comprehensive analysis and insight for future development. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. Data and code are available at https://github.com/We-Math/We-Math.

Anthology ID:: 2025.acl-long.983
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20023–20070
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.983/
DOI:
Bibkey:
Cite (ACL):: Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, YiFan Zhang, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Xiao Zong, Yida Xu, Peiqing Yang, Zhimin Bao, Muxi Diao, Chen Li, and Honggang Zhang. 2025. We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20023–20070, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? (Qiao et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.983.pdf

PDF Cite Search Fix data