Zihao Cheng
2026
Mem2Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
Zihao Cheng | Zeming Liu | Yingyu Shan | Xinyi Wang | Xiangrong Zhu | Yunpu Ma | Hongru Wang | Yuhang Guo | Wei Lin | Yunhong Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zihao Cheng | Zeming Liu | Yingyu Shan | Xinyi Wang | Xiangrong Zhu | Yunpu Ma | Hongru Wang | Yuhang Guo | Wei Lin | Yunhong Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While large language model–powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the **Mem2Evolve**, which integrates two core components: **Experience Memory** and **Asset Memory**. Specifically, Mem2Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent’s capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem2Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework.
2025
ToolSpectrum: Towards Personalized Tool Utilization for Large Language Models
Zihao Cheng | Hongru Wang | Zeming Liu | Yuhang Guo | Yuanfang Guo | Yunhong Wang | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2025
Zihao Cheng | Hongru Wang | Zeming Liu | Yuhang Guo | Yuanfang Guo | Yunhong Wang | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2025
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions while overlooking the critical role of context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs’ capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool selection. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool selection significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code will be released soon.
RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models
Jingjing Liu | Zeming Liu | Zihao Cheng | Mengliang He | Xiaoming Shi | Yuhang Guo | Xiangrong Zhu | Yuanfang Guo | Yunhong Wang | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Jingjing Liu | Zeming Liu | Zihao Cheng | Mengliang He | Xiaoming Shi | Yuhang Guo | Xiangrong Zhu | Yuanfang Guo | Yunhong Wang | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant advancements in debugging datasets have been made to promote the development of code debugging. However, these datasets primarily focus on assessing the LLM’s function-level code repair capabilities, neglecting the more complex and realistic repository-level scenarios, which leads to an incomplete understanding of the LLM’s challenges in repository-level debugging. While several repository-level datasets have been proposed, they often suffer from limitations such as limited diversity of tasks, languages, and error types. To mitigate this challenge, this paper introduces RepoDebug, a multi-task and multi-language repository-level code debugging dataset with 22 subtypes of errors that supports 8 commonly used programming languages and 3 debugging tasks. Furthermore, we conduct evaluation experiments on 10 LLMs, where Claude 3.5 Sonnect, the best-performing model, still cannot perform well in repository-level debugging.