Boyang Hong
2025
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi
|
Yiwen Ding
|
Wenxiang Chen
|
Boyang Hong
|
Honglin Guo
|
Junzhe Wang
|
Xin Guo
|
Dingwen Yang
|
Chenyang Liao
|
Wei He
|
Songyang Gao
|
Lu Chen
|
Rui Zheng
|
Yicheng Zou
|
Tao Gui
|
Qi Zhang
|
Xipeng Qiu
|
Xuanjing Huang
|
Zuxuan Wu
|
Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen
|
Wei He
|
Zhiheng Xi
|
Honglin Guo
|
Boyang Hong
|
Jiazheng Zhang
|
Nijun Li
|
Tao Gui
|
Yun Li
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
Search
Fix author
Co-authors
- Wenxiang Chen 2
- Tao Gui 2
- Honglin Guo 2
- Wei He 2
- Xuan-Jing Huang (黄萱菁) 2
- show all...