Honglin Guo
2026
AgentGym2: Benchmarking Large Language Model Agents in De-Idealized Real-World Environments
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Dingwen Yang | Jiaqi Liu | Jixuan Huang | Honglin Guo | Baodai Huang | Tinggang Chen | Qi Zhang | Zhonghang Lu | Chenyu Liu | Jiajun Sun | Jiazheng Zhang | Dingwei Zhu | Xin Guo | Junzhe Wang | Zhihao Zhang | Yuming Yang | Junjie Ye | Minghe Gao | Dongrui Liu | Jiaming Ji | Guohao Li | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language agents, i.e., LLM agents, progress rapidly and are increasingly deployed in production environments. This trend underscores the urgent need for rigorous and realistic evaluations. However, most existing benchmarks evaluate agents in simplified, idealized settings. They typically rely on pre-packaged tool interfaces, overlook critical steps, and assume inputs are clean and fully specified. Consequently, they understate the difficulty of real deployments, where uncertainty and noise are ubiquitous and agents must proactively explore the environment to uncover new tools. To bridge this gap, we present AgentGym2, a new evaluation framework with task instances grounded in real-world end-to-end working demands. Beyond reasoning and planning, it measures agents’ ability to execute end-to-end procedures, discover tools via exploration, compose tools for unseen tasks, and remain robust to noisy and underspecified information. Experiments on 15 proprietary and open-source models show that even SOTA systems like Gemini and GPT-5 struggle on AgentGym2, revealing a substantial gap between the capability of current agents and the demands of real-world applications.
OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Deming Ding | Shichun Liu | Enhui Yang | Jiahang Lin | Ziying Chen | Shihan Dou | Honglin Guo | Weiyu Cheng | Pengyu Zhao | Chengjun Xiao | Qunhong Zeng | Qi Zhang | Xuanjing Huang | Qidi Xu | Tao Gui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deming Ding | Shichun Liu | Enhui Yang | Jiahang Lin | Ziying Chen | Shihan Dou | Honglin Guo | Weiyu Cheng | Pengyu Zhao | Chengjun Xiao | Qunhong Zeng | Qi Zhang | Xuanjing Huang | Qidi Xu | Tao Gui
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We will release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Jie Yang | Honglin Guo | Li Ji | Jiazheng Zhou | Rui Zheng | Zhikai Lei | Shuo Zhang | Zhiheng Xi | Shichun Liu | Yuxin Wang | Bo Wang | Yining Zheng | Tao Gui | Xipeng Qiu
Findings of the Association for Computational Linguistics: ACL 2026
Jie Yang | Honglin Guo | Li Ji | Jiazheng Zhou | Rui Zheng | Zhikai Lei | Shuo Zhang | Zhiheng Xi | Shichun Liu | Yuxin Wang | Bo Wang | Yining Zheng | Tao Gui | Xipeng Qiu
Findings of the Association for Computational Linguistics: ACL 2026
The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering.
MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce **MM-Doc-R1**, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose **Similarity-based Policy Optimization (SPO)**, addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that **MM-Doc-R1** outperforms previous baselines by **10.4%**. Furthermore, **SPO** demonstrates superior performance over **GRPO**, boosting results by **5.0%** with Qwen3-8B and **6.1%** with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
2025
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiheng Xi | Yiwen Ding | Wenxiang Chen | Boyang Hong | Honglin Guo | Junzhe Wang | Xin Guo | Dingwen Yang | Chenyang Liao | Wei He | Songyang Gao | Lu Chen | Rui Zheng | Yicheng Zou | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang | Zuxuan Wu | Yu-Gang Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo | Kai Lv | Qipeng Guo | Tianyi Liang | Zhiheng Xi | Demin Song | Qiuyinzhe Zhang | Yu Sun | Kai Chen | Xipeng Qiu | Tao Gui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Honglin Guo | Kai Lv | Qipeng Guo | Tianyi Liang | Zhiheng Xi | Demin Song | Qiuyinzhe Zhang | Yu Sun | Kai Chen | Xipeng Qiu | Tao Gui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, orcareful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier-based methods, verbal criteria are more interpretable and have greater reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.2 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Wenxiang Chen | Wei He | Zhiheng Xi | Honglin Guo | Boyang Hong | Jiazheng Zhang | Nijun Li | Tao Gui | Yun Li | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2025
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
2024
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation
Demin Song | Honglin Guo | Yunhua Zhou | Shuhao Xing | Yudong Wang | Zifan Song | Wenwei Zhang | Qipeng Guo | Hang Yan | Xipeng Qiu | Dahua Lin
Findings of the Association for Computational Linguistics: ACL 2024
Demin Song | Honglin Guo | Yunhua Zhou | Shuhao Xing | Yudong Wang | Zifan Song | Wenwei Zhang | Qipeng Guo | Hang Yan | Xipeng Qiu | Dahua Lin
Findings of the Association for Computational Linguistics: ACL 2024
The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs’ performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
2023
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way
Kai Lv | Shuo Zhang | Tianle Gu | Shuhao Xing | Jiawei Hong | Keyu Chen | Xiaoran Liu | Yuqing Yang | Honglin Guo | Tengxiao Liu | Yu Sun | Qipeng Guo | Hang Yan | Xipeng Qiu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Kai Lv | Shuo Zhang | Tianle Gu | Shuhao Xing | Jiawei Hong | Keyu Chen | Xiaoran Liu | Yuqing Yang | Honglin Guo | Tengxiao Liu | Yu Sun | Qipeng Guo | Hang Yan | Xipeng Qiu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of large language models using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, and LOMO. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at https://github.com/OpenLMLab/collie.
Search
Fix author
Co-authors
- Tao Gui 7
- Zhiheng Xi 6
- Xuan-Jing Huang (黄萱菁) 5
- Xipeng Qiu (邱锡鹏) 5
- Qipeng Guo 3
- Shichun Liu 3
- Junzhe Wang 3
- Qi Zhang 3
- Wenxiang Chen 2
- Shihan Dou 2
- Wei He 2
- Boyang Hong 2
- Jiahang Lin 2
- Kai Lv 2
- Demin Song 2
- Yu Sun 2
- Shuhao Xing 2
- Hang Yan 2
- Dingwen Yang 2
- Qi Zhang 2
- Jiazheng Zhang 2
- Lu Chen 1
- Keyu Chen 1
- Tinggang Chen 1
- Kai Chen 1
- Ziying Chen 1
- Weiyu Cheng 1
- Yiwen Ding 1
- Deming Ding 1
- Songyang Gao 1
- Minghe Gao 1
- Tianle Gu 1
- Xin Guo 1
- Xin Guo 1
- Zhenhua Han 1
- Jiawei Hong 1
- Kai Hu 1
- Jixuan Huang 1
- Baodai Huang 1
- Jiaming Ji 1
- Li Ji 1
- Yu-Gang Jiang 1
- Zhikai Lei 1
- Guohao Li 1
- Nijun Li 1
- Yun Li 1
- Tianyi Liang 1
- Chenyang Liao 1
- Dahua Lin 1
- Xiaoran Liu 1
- Tengxiao Liu 1
- Jiaqi Liu 1
- Chenyu Liu 1
- Dongrui Liu 1
- Zhonghang Lu 1
- Zifan Song 1
- Jiajun Sun 1
- Yudong Wang 1
- Yuxin Wang 1
- Bo Wang 1
- Binghai Wang 1
- Zuxuan Wu 1
- Chengjun Xiao 1
- Qidi Xu 1
- Hang Yan 1
- Yuqing Yang 1
- Yuming Yang 1
- Enhui Yang 1
- Jie Yang 1
- Junjie Ye (叶俊杰) 1
- Qunhong Zeng 1
- Shuo Zhang 1
- Qi Zhang 1
- Zhihao Zhang 1
- Qiuyinzhe Zhang 1
- Wenwei Zhang 1
- Shuo Zhang 1
- Pengyu Zhao 1
- Rui Zheng 1
- Rui Zheng 1
- Yining Zheng 1
- Yunhua Zhou 1
- Jiazheng Zhou 1
- Yuhao Zhou 1
- Enyu Zhou 1
- Dingwei Zhu 1
- Yicheng Zou 1