GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Jianghangfan Zhang; Yibo Yan; Kening Zheng; Xin Zou; Song Dai; Xuming Hu

GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu

Abstract

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the **Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator**. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset.

Anthology ID:: 2026.alvr-main.11
Volume:: Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:: ALVR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 139–154
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.11/
DOI:
Bibkey:
Cite (ACL):: Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, and Xuming Hu. 2026. GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 139–154, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning (Zhang et al., ALVR 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.11.pdf

PDF Cite Search Fix data