Dingwen Yang
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
2025
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
Search
Fix author
Co-authors
- Tao Gui 2
- Honglin Guo 2
- Xuan-Jing Huang (黄萱菁) 2
- Junzhe Wang 2
- Zhiheng Xi 2
- Wenxiang Chen 1
- Lu Chen 1
- Tinggang Chen 1
- Yiwen Ding 1
- Songyang Gao 1
- Minghe Gao 1
- Xin Guo 1
- Xin Guo 1
- Wei He 1
- Boyang Hong 1
- Jixuan Huang 1
- Baodai Huang 1
- Jiaming Ji 1
- Yu-Gang Jiang 1
- Guohao Li 1
- Chenyang Liao 1
- Jiaqi Liu 1
- Chenyu Liu 1
- Dongrui Liu 1
- Zhonghang Lu 1
- Xipeng Qiu (邱锡鹏) 1
- Jiajun Sun 1
- Zuxuan Wu 1
- Yuming Yang 1
- Junjie Ye (叶俊杰) 1
- Qi Zhang 1
- Qi Zhang 1
- Jiazheng Zhang 1
- Zhihao Zhang 1
- Qi Zhang 1
- Rui Zheng 1
- Dingwei Zhu 1
- Yicheng Zou 1
Venues
- ACL2