Congmin Zheng
2026
A Comprehensive Survey of Process Reward Models: Data Generation, Model Construction, and Usage
Congmin Zheng | Jiachen Zhu | Zhuoying Ou | Yuxiang Chen | Kangning Zhang | Rong Shan | Zeyu Zheng | Mengyue Yang | Jianghao Lin | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Congmin Zheng | Jiachen Zhu | Zhuoying Ou | Yuxiang Chen | Kangning Zhang | Rong Shan | Zeyu Zheng | Mengyue Yang | Jianghao Lin | Yong Yu | Weinan Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have advanced reasoning ability, yet conventional alignment remains dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
2025
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Jiachen Zhu | Congmin Zheng | Jianghao Lin | Kounianhua Du | Ying Wen | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Jiachen Zhu | Congmin Zheng | Jianghao Lin | Kounianhua Du | Ying Wen | Yong Yu | Jun Wang | Weinan Zhang
Findings of the Association for Computational Linguistics: ACL 2025
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies the OOD issues including step OOD, arising from differences in reasoning patterns across model types and sizes, and question OOD, due to dataset shifts between training and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps for PRM as a warmup to stimulate its potential to judge target steps, improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetreivalPRM model, establishing a new standard for PRM performance.