Zhenpeng Chen
2026
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents
Yaoqi Guo | Ying Xiao | Jie M. Zhang | Mark Harman | Yiling Lou | Yang Liu | Zhenpeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Yaoqi Guo | Ying Xiao | Jie M. Zhang | Mark Harman | Yiling Lou | Yang Liu | Zhenpeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Software engineering (SE) agents powered by large language models are increasingly adopted in practice, yet they often incur substantial monetary cost. We introduce EET, an experience-driven early termination approach that reduces the cost of SE agents while preserving task performance. EET extracts structured experience from prior issue-resolution executions and leverages it to guide early termination during patch generation and selection, reducing unproductive iterations. We evaluate EET on the SWE-bench Verified benchmark across three representative SE agents. EET consistently reduces total cost by 19%–55% (32% on average), with negligible loss in resolution rate (at most 0.2%). These efficiency gains are achieved, on average, by identifying early-termination opportunities for 11% of issues and reducing API calls, input tokens, and output tokens by 21%, 30%, and 25%, respectively. We release the code, prompts, and data at https://github.com/IanWalls/EET.
2025
LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs
Kaibo Liu | Zhenpeng Chen | Yiyang Liu | Jie M. Zhang | Mark Harman | Yudong Han | Yun Ma | Yihong Dong | Ge Li | Gang Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kaibo Liu | Zhenpeng Chen | Yiyang Liu | Jie M. Zhang | Mark Harman | Yudong Han | Yun Ma | Yihong Dong | Ge Li | Gang Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80×, 2.65×, and 1.66× those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher/.
Personality-Guided Code Generation Using Large Language Models
Yaoqi Guo | Zhenpeng Chen | Jie M. Zhang | Yang Liu | Yun Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yaoqi Guo | Zhenpeng Chen | Jie M. Zhang | Yang Liu | Yun Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.