Better Process Supervision with Bi-directional Rewarding Signals

Wenxiang Chen; Wei He; Zhiheng Xi; Honglin Guo; Boyang Hong; Jiazheng Zhang; Nijun Li; Tao Gui; Yun Li; Qi Zhang; Xuan-Jing Huang (黄萱菁)

Better Process Supervision with Bi-directional Rewarding Signals

Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Nijun Li, Tao Gui, Yun Li, Qi Zhang, Xuanjing Huang

Abstract

Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.

Anthology ID:: 2025.findings-acl.747
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14471–14485
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.747/
DOI:
Bibkey:
Cite (ACL):: Wenxiang Chen, Wei He, Zhiheng Xi, Honglin Guo, Boyang Hong, Jiazheng Zhang, Nijun Li, Tao Gui, Yun Li, Qi Zhang, and Xuanjing Huang. 2025. Better Process Supervision with Bi-directional Rewarding Signals. In Findings of the Association for Computational Linguistics: ACL 2025, pages 14471–14485, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Better Process Supervision with Bi-directional Rewarding Signals (Chen et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.747.pdf

PDF Cite Search Fix data