Wenxiang Chen
2026
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiazheng Zhang | Ziche Fu | Zhiheng Xi | Wenqing Jing | Mingxu Chai | Wei He | Guoqiang Zhang | Chenghao Fan | Chenxin An | Wenxiang Chen | Zhicheng Liu | Haojie Pan | Dingwei Zhu | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.
2025
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
2024
ORTicket: Let One Robust BERT Ticket Transfer across Different Tasks
Yuhao Zhou | Wenxiang Chen | Rui Zheng | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Yuhao Zhou | Wenxiang Chen | Rui Zheng | Zhiheng Xi | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Pretrained language models can be applied for various downstream tasks but are susceptible to subtle perturbations. Most adversarial defense methods often introduce adversarial training during the fine-tuning phase to enhance empirical robustness. However, the repeated execution of adversarial training hinders training efficiency when transitioning to different tasks. In this paper, we explore the transferability of robustness within subnetworks and leverage this insight to introduce a novel adversarial defense method ORTicket, eliminating the need for separate adversarial training across diverse downstream tasks. Specifically, (i) pruning the full model using the MLM task (the same task employed for BERT pretraining) yields a task-agnostic robust subnetwork(i.e., winning ticket in Lottery Ticket Hypothesis); and (ii) fine-tuning this subnetwork for downstream tasks. Extensive experiments demonstrate that our approach achieves comparable robustness to other defense methods while retaining the efficiency of traditional fine-tuning.This also confirms the significance of selecting MLM task for identifying the transferable robust subnetwork. Furthermore, our method is orthogonal to other adversarial training approaches, indicating the potential for further enhancement of model robustness.
Search
Fix author
Co-authors
- Tao Gui 4
- Xuan-Jing Huang (黄萱菁) 4
- Zhiheng Xi 4
- Wei He 3
- Qi Zhang 3
- Honglin Guo 2
- Boyang Hong 2
- Jiazheng Zhang 2
- Rui Zheng 2
- Chenxin An 1
- Mingxu Chai 1
- Lu Chen 1
- Yiwen Ding 1
- Chenghao Fan 1
- Ziche Fu 1
- Songyang Gao 1
- Xin Guo 1
- Yu-Gang Jiang 1
- Wenqing Jing 1
- Nijun Li 1
- Yun Li 1
- Chenyang Liao 1
- Zhicheng Liu 1
- Haojie Pan 1
- Xipeng Qiu (邱锡鹏) 1
- Junzhe Wang 1
- Zuxuan Wu 1
- Dingwen Yang 1
- Guoqiang Zhang 1
- Qi Zhang 1
- Yuhao Zhou 1
- Dingwei Zhu 1
- Yicheng Zou 1