Dingwei Zhu
2026
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
Search
Fix author
Co-authors
- Tao Gui 3
- Xuan-Jing Huang (黄萱菁) 3
- Zhiheng Xi 3
- Jiazheng Zhang 3
- Qi Zhang 3
- Mingxu Chai 2
- Junjie Ye (叶俊杰) 2
- Guoqiang Zhang 2
- Chenxin An 1
- Wenxiang Chen 1
- Tinggang Chen 1
- Shihan Dou 1
- Chenghao Fan 1
- Ziche Fu 1
- Minghe Gao 1
- Honglin Guo 1
- Xin Guo 1
- Wei He 1
- Jixuan Huang 1
- Baodai Huang 1
- Caishuang Huang 1
- Chenhao Huang 1
- Jiaming Ji 1
- Senjie Jin 1
- Wenqing Jing 1
- Guohao Li 1
- Zhicheng Liu 1
- Jiaqi Liu 1
- Chenyu Liu 1
- Dongrui Liu 1
- Zhonghang Lu 1
- Haojie Pan 1
- Xipeng Qiu (邱锡鹏) 1
- Jiajun Sun 1
- Junzhe Wang 1
- Yuhui Wang 1
- Yuran Wang 1
- Dingwen Yang 1
- Yuming Yang 1
- Qi Zhang 1
- Zhihao Zhang 1
- Ming Zhang 1
- Yunke Zhang 1
- Enyu Zhou 1