Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA

Chi-Min Chan; Chunpu Xu; Junqi Zhu; Jiaming Ji; Donghai Hong; Pengcheng Wen; Chunyang Jiang; Zhen Ye; Yaodong Yang; Wei Xue; Sirui Han; Yike Guo

Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA

Chi-Min Chan, Chunpu Xu, Junqi Zhu, Jiaming Ji, Donghai Hong, Pengcheng Wen, Chunyang Jiang, Zhen Ye, Yaodong Yang, Wei Xue, Sirui Han, Yike Guo

Abstract

The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.

Anthology ID:: 2025.findings-acl.388
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7433–7451
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.388/
DOI:
Bibkey:
Cite (ACL):: Chi-Min Chan, Chunpu Xu, Junqi Zhu, Jiaming Ji, Donghai Hong, Pengcheng Wen, Chunyang Jiang, Zhen Ye, Yaodong Yang, Wei Xue, Sirui Han, and Yike Guo. 2025. Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA. In Findings of the Association for Computational Linguistics: ACL 2025, pages 7433–7451, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA (Chan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.388.pdf

PDF Cite Search Fix data