Guohao Li
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
AVA: Attentive VLM Agent for Mastering StarCraft II
Weiyu Ma | Yuqian Fu | Zecheng Zhang | Bernard Ghanem | Guohao Li
Findings of the Association for Computational Linguistics: ACL 2026
Weiyu Ma | Yuqian Fu | Zecheng Zhang | Bernard Ghanem | Guohao Li
Findings of the Association for Computational Linguistics: ACL 2026
We introduce AVACraft — the first multimodal benchmark environment for complex decision-making in StarCraft II, supporting both traditional Multi-Agent Reinforcement Learning (MARL) and modern Vision-Language Model (VLM) paradigms. Existing StarCraft II environments like SMAC rely on abstract state representations that deviate from human perception and lack support for emerging VLM-based decision-making. AVACraft mitigates these limitations via a unified framework, which provides RGB visual inputs, natural language observations and structured state information, enabling systematic comparisons between training-based and zero-shot decision-making methods. Our benchmark features 21 carefully designed scenarios covering micromanagement, coordination and strategic planning, with standardized evaluation protocols for both paradigms. We establish comprehensive baselines using four MARL algorithms (IQL, QMIX, QTRAN, VDN) and multiple state-of-the-art VLMs (GPT-4o, Qwen-VL, etc.). Experimental results reveal their complementary strengths: MARL methods achieve up to 27.1% win rate after 1M training steps in complex scenarios, while VLMs deliver superior zero-shot performance (75–81% win rate) and human-aligned decision processes without any training. Systematic analysis (including expert human evaluation) also identifies key trade-offs between training efficiency, performance ceilings and interpretability across the two paradigms. Our implementation is available at https://anonymous.4open.science/r/VLM-Play-StarCraft2-70C4 .
2025
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Qiushi Sun | Kanzhi Cheng | Zichen Ding | Chuanyang Jin | Yian Wang | Fangzhi Xu | Zhenyu Wu | Chengyou Jia | Liheng Chen | Zhoumianze Liu | Ben Kao | Guohao Li | Junxian He | Yu Qiao | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiushi Sun | Kanzhi Cheng | Zichen Ding | Chuanyang Jin | Yian Wang | Fangzhi Xu | Zhenyu Wu | Chengyou Jia | Liheng Chen | Zhoumianze Liu | Ben Kao | Guohao Li | Junxian He | Yu Qiao | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, the development of such agents faces a critical bottleneck: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Further, these approaches exhibit significant gaps between the generated data and online environments, alongside limited data diversity. To address this issue, we introduce OS-Genesis, a novel GUI data synthesis pipeline that overcomes the challenges above. Unlike prior methods that rely on preset tasks, OS-Genesis reverse engineers the GUI trajectory construction process. Agents first perceive environments and perform step-level interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis’s cost-effectiveness and its superior data quality and diversity compared to existing synthesis methods.
Can an Individual Manipulate the Collective Decisions of Multi-Agents?
Fengyuan Liu | Rui Zhao | Shuo Chen | Guohao Li | Philip Torr | Lei Han | Jindong Gu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fengyuan Liu | Rui Zhao | Shuo Chen | Guohao Li | Philip Torr | Lei Han | Jindong Gu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision?To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system’s collaborative decision-making process.More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system.Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework.We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
Beyond Human Labels: A Multi-Linguistic Auto-Generated Benchmark for Evaluating Large Language Models on Resume Parsing
Zijian Ling | Han Zhang | Jiahao Cui | Zhequn Wu | Xu Sun | Guohao Li | Xiangjian He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zijian Ling | Han Zhang | Jiahao Cui | Zhequn Wu | Xu Sun | Guohao Li | Xiangjian He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Efficient resume parsing is critical for global hiring, yet the absence of dedicated benchmarks for evaluating large language models (LLMs) on multilingual, structure-rich resumes hinders progress. To address this, we introduce ResumeBench, the first privacy-compliant benchmark comprising 2,500 synthetic resumes spanning 50 templates, 30 career fields, and 5 languages. These resumes are generated through a human-in-the-loop pipeline that prioritizes realism, diversity, and privacy compliance, which are validated against real-world resumes. This paper evaluates 24 state-of-the-art LLMs on ResumeBench, revealing substantial variations in handling resume complexities. Specifically, top-performing models like GPT-4o exhibit challenges in cross-lingual structural alignment while smaller models show inconsistent scaling effects. Code-specialized LLMs underperform relative to generalists, while JSON outputs enhance schema compliance but fail to address semantic ambiguities. Our findings underscore the necessity for domain-specific optimization and hybrid training strategies to enhance structural and contextual reasoning in LLMs.
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu | Linyao Chen | Dai-Jie Wu | Yanjun Chen | Zecheng Zhang | Xiang Yao | Zhiqiang Xie | Yongchao Chen | Shilong Liu | Bochen Qian | Anjie Yang | Zhaoxuan Jin | Jianbo Deng | Philip Torr | Bernard Ghanem | Guohao Li
Findings of the Association for Computational Linguistics: ACL 2025
Tianqi Xu | Linyao Chen | Dai-Jie Wu | Yanjun Chen | Zecheng Zhang | Xiang Yao | Zhiqiang Xie | Yongchao Chen | Shilong Liu | Bochen Qian | Anjie Yang | Zhaoxuan Jin | Jianbo Deng | Philip Torr | Bernard Ghanem | Guohao Li
Findings of the Association for Computational Linguistics: ACL 2025
The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and thecomplexities of constructing tasks and evaluators. To overcome these limitations, we introduce CRAB, the first cross-environment agent benchmark framework, incorporating a graph-based fine-grained evaluation method and an efficient task generation method. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging CRAB, we develope CRAB Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated 6 advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
2024
Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Wenbo Li | Guohao Li | Zhibin Lan | Xue Xu | Wanru Zhuang | Jiachen Liu | Xinyan Xiao | Jinsong Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Wenbo Li | Guohao Li | Zhibin Lan | Xue Xu | Wanru Zhuang | Jiachen Liu | Xinyan Xiao | Jinsong Su
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese texts, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that BPE tokenization and insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.
Search
Fix author
Co-authors
- Bernard Ghanem 2
- Philip Torr 2
- Zecheng Zhang 2
- Liheng Chen 1
- Tinggang Chen 1
- Shuo Chen 1
- Linyao Chen 1
- Yanjun Chen 1
- Yongchao Chen 1
- Kanzhi Cheng 1
- Jiahao Cui 1
- Jianbo Deng 1
- Zichen Ding 1
- Yuqian Fu 1
- Minghe Gao 1
- Jindong Gu 1
- Tao Gui 1
- Honglin Guo 1
- Xin Guo 1
- Lei Han 1
- Junxian He 1
- Xiangjian He 1
- Jixuan Huang 1
- Baodai Huang 1
- Xuan-Jing Huang (黄萱菁) 1
- Jiaming Ji 1
- Chengyou Jia 1
- Chuanyang Jin 1
- Zhaoxuan Jin 1
- Ben Kao 1
- Zhibin Lan 1
- Wenbo Li 1
- Zijian Ling 1
- Zhoumianze Liu 1
- Jiaqi Liu 1
- Chenyu Liu 1
- Dongrui Liu 1
- Fengyuan Liu 1
- Shilong Liu 1
- Jiachen Liu 1
- Zhonghang Lu 1
- Weiyu Ma 1
- Bochen Qian 1
- Yu Qiao 1
- Jinsong Su 1
- Qiushi Sun 1
- Jiajun Sun 1
- Xu Sun 1
- Yian Wang 1
- Junzhe Wang 1
- Zhenyu Wu 1
- Zhiyong Wu 1
- Zhequn Wu 1
- Dai-Jie Wu 1
- Zhiheng Xi 1
- Xinyan Xiao 1
- Zhiqiang Xie (谢志强) 1
- Fangzhi Xu 1
- Tianqi Xu 1
- Xue Xu 1
- Dingwen Yang 1
- Yuming Yang 1
- Anjie Yang 1
- Xiang Yao 1
- Junjie Ye (叶俊杰) 1
- Qi Zhang 1
- Jiazheng Zhang 1
- Zhihao Zhang 1
- Qi Zhang 1
- Han Zhang 1
- Rui Zhao 1
- Dingwei Zhu 1
- Wanru Zhuang 1