Chunyang Jiang
HKUST
Other people with similar names: Chunyang Jiang (May refer to several people)
2025
Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA
Chi-Min Chan
|
Chunpu Xu
|
Junqi Zhu
|
Jiaming Ji
|
Donghai Hong
|
Pengcheng Wen
|
Chunyang Jiang
|
Zhen Ye
|
Yaodong Yang
|
Wei Xue
|
Sirui Han
|
Yike Guo
Findings of the Association for Computational Linguistics: ACL 2025
The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.
2023
Path Spuriousness-aware Reinforcement Learning for Multi-Hop Knowledge Graph Reasoning
Chunyang Jiang
|
Tianchen Zhu
|
Haoyi Zhou
|
Chang Liu
|
Ting Deng
|
Chunming Hu
|
Jianxin Li
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Multi-hop reasoning, a prevalent approach for query answering, aims at inferring new facts along reasonable paths over a knowledge graph. Reinforcement learning methods can be adopted by formulating the problem into a Markov decision process. However, common suffering within RL-based reasoning models is that the agent can be biased to spurious paths which coincidentally lead to the correct answer with poor explanation. In this work, we take a deep dive into this phenomenon and define a metric named Path Spuriousness (PS), to quantitatively estimate to what extent a path is spurious. Guided by the definition of PS, we design a model with a new reward that considers both answer accuracy and path reasonableness. We test our method on four datasets and experiments reveal that our method considerably enhances the agent’s capacity to prevent spurious paths while keeping comparable to state-of-the-art performance.