AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments

Zhiheng Xi; Dingwen Yang; Jiaqi Liu; Jixuan Huang; Honglin Guo; Baodai Huang; Tinggang Chen; Qi Zhang; Zhonghang Lu; Chenyu Liu; Jiajun Sun; Jiazheng Zhang; Dingwei Zhu; Xin Guo; Junzhe Wang; Zhihao Zhang; Yuming Yang; Junjie Ye (叶俊杰); Minghe Gao; Dongrui Liu; Jiaming Ji; Guohao Li; Tao Gui; Qi Zhang; Xuan-Jing Huang (黄萱菁)

AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments

Zhiheng Xi, Dingwen Yang, Jiaqi Liu, Jixuan Huang, Honglin Guo, Baodai Huang, Tinggang Chen, Qi Zhang, Zhonghang Lu, Chenyu Liu, Jiajun Sun, Jiazheng Zhang, Dingwei Zhu, Xin Guo, Junzhe Wang, Zhihao Zhang, Yuming Yang, Junjie Ye, Minghe Gao, Dongrui Liu, Jiaming Ji, Guohao Li, Tao Gui, Qi Zhang, Xuanjing Huang

Abstract

Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.

Anthology ID:: 2026.acl-long.2058
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 44451–44479
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2058/
DOI:
Bibkey:
Cite (ACL):: Zhiheng Xi, Dingwen Yang, Jiaqi Liu, Jixuan Huang, Honglin Guo, Baodai Huang, Tinggang Chen, Qi Zhang, Zhonghang Lu, Chenyu Liu, Jiajun Sun, Jiazheng Zhang, Dingwei Zhu, Xin Guo, Junzhe Wang, Zhihao Zhang, Yuming Yang, Junjie Ye, Minghe Gao, Dongrui Liu, Jiaming Ji, Guohao Li, Tao Gui, Qi Zhang, and Xuanjing Huang. 2026. AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 44451–44479, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments (Xi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.2058.pdf
Checklist:: 2026.acl-long.2058.checklist.pdf

PDF Cite Search Checklist Fix data