PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

Mingyang Song; Zhaochen Su; Xiaoye Qu; Jiawei Zhou; Yu Cheng

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, Yu Cheng

Abstract

Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs’ performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 25 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research, establishing PRMBench as a robust testbed for advancing research on PRM evaluation and development.

Anthology ID:: 2025.acl-long.1230
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25299–25346
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1230/
DOI:
Bibkey:
Cite (ACL):: Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. 2025. PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25299–25346, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models (Song et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1230.pdf

PDF Cite Search Fix data