PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Wenfeng Feng; Penghong Zhao; Guochao Jiang; Chuzhan Hao; Guohua Liu; Yuewei Zhang

PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning

Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Guohua Liu, Yuewei Zhang

Abstract

Grouping-based methods have emerged as a significant frontier in Reinforcement Learning (RL), yet agentic reasoning poses a fundamental challenge for grouping-based methods: frequent environmental interactions and multi-step tool invocation generate highly variable trajectories, rendering intra-group advantage estimation unstable. In response, practitioners resort to excessive rollouts to stabilize training, which in turn incurs prohibitive computational costs. This negative feedback loop between advantage estimation instability and sampling inefficiency severely limits learning performance. We present PVPO, a stable and efficient critic-free RL framework that breaks this cycle through a pre-estimated value baseline and pre-sampled data filtering. Specifically, before training begins, PVPO performs a single round of rollouts to compute two signals: (1) Static V, a Monte Carlo estimate of the expected return that serves as a fixed baseline to stabilize advantage estimation; and (2) sample-level accuracy, as a difficulty metric to filter out trivial samples and inject ground-truth trajectories into hard ones, thereby enhancing training efficiency. As shown in Figure 1, experiments demonstrate that PVPO outperforms other grouping-based methods in both multi-step retrieval tasks and advanced mathematical reasoning benchmarks. Notably, our 7B model trained with PVPO matches or exceeds the performance of large language models (LLMs). Moreover, PVPO achieves a 2.5x speedup in training time compared to prior methods while maintaining comparable final performance.

Anthology ID:: 2026.findings-acl.182
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3729–3748
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.182/
DOI:
Bibkey:
Cite (ACL):: Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Guohua Liu, and Yuewei Zhang. 2026. PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 3729–3748, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning (Feng et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.182.pdf
Checklist:: 2026.findings-acl.182.checklist.pdf

PDF Cite Search Checklist Fix data