Jie M. Zhang
2026
EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents
Yaoqi Guo | Ying Xiao | Jie M. Zhang | Mark Harman | Yiling Lou | Yang Liu | Zhenpeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Yaoqi Guo | Ying Xiao | Jie M. Zhang | Mark Harman | Yiling Lou | Yang Liu | Zhenpeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Software engineering (SE) agents powered by large language models are increasingly adopted in practice, yet they often incur substantial monetary cost. We introduce EET, an experience-driven early termination approach that reduces the cost of SE agents while preserving task performance. EET extracts structured experience from prior issue-resolution executions and leverages it to guide early termination during patch generation and selection, reducing unproductive iterations. We evaluate EET on the SWE-bench Verified benchmark across three representative SE agents. EET consistently reduces total cost by 19%–55% (32% on average), with negligible loss in resolution rate (at most 0.2%). These efficiency gains are achieved, on average, by identifying early-termination opportunities for 11% of issues and reducing API calls, input tokens, and output tokens by 21%, 30%, and 25%, respectively. We release the code, prompts, and data at https://github.com/IanWalls/EET.
A Study of LLMs’ Preferences for Libraries and Programming Languages
Lukas Twist | Jie M. Zhang | Mark Harman | Don Syme | Joost Noppen | Helen Yannakoudakis | Detlef Nauck
Findings of the Association for Computational Linguistics: ACL 2026
Lukas Twist | Jie M. Zhang | Mark Harman | Don Syme | Joost Noppen | Helen Yannakoudakis | Detlef Nauck
Findings of the Association for Computational Linguistics: ACL 2026
Despite the rapid progress of large language models (LLMs) in code generation, existing evaluations focus on functional correctness or syntactic validity, overlooking how LLMs make critical design choices such as which library or programming language to use.To fill this gap, we perform the first empirical study of LLMs’ preferences for libraries and programming languages when generating code, covering eight diverse LLMs.We observe a strong tendency to overuse widely adopted libraries such as NumPy; in up to 45% of cases, this usage is not required and deviates from the ground-truth solutions.The LLMs we study also show a significant preference toward Python as their default language.For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used once.These results highlight how LLMs prioritise familiarity and popularity over suitability and task-specific optimality;underscoring the need for targeted fine-tuning, data diversification, and evaluation benchmarks that explicitly measure language and library selection fidelity.
TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
Zhihao Gong | Zeyu Sun | Dong Huang | Qingyuan Liang | Jie M. Zhang | Dan Hao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhihao Gong | Zeyu Sun | Dong Huang | Qingyuan Liang | Jie M. Zhang | Dan Hao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of execution efficiency remains overlooked. We present Trace, the first benchmark to explicitly assess efficiency in LLM-translated code. Trace includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency disparities often overlooked by small-scale tests. Using Trace, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness and efficiency are often misaligned: the correctness leader Claude-Sonnet-4-Think achieves only moderate time efficiency, outperformed by smaller open-source LLMs such as Qwen2.5-Coder-14B-Instruct. 2) Inefficiency is both prevalent and patterned: 23.5% of correct translations suffer from notable inefficiency, mainly arising from algorithm implementation discrepancy (11.9%), language construct mismatch (66.4%), and resource management inefficiency (21.7%).3) Inference-time prompt strategies bring only modest improvements, indicating that simple prompting alone is insufficient to improve translation efficiency. Together, our results establish execution efficiency as an essential dimension of code translation and position Trace as a principled foundation for efficiency-oriented evaluation. Our code and data are available at: https://github.com/Albert-Gong/TRACE.
2025
Personality-Guided Code Generation Using Large Language Models
Yaoqi Guo | Zhenpeng Chen | Jie M. Zhang | Yang Liu | Yun Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yaoqi Guo | Zhenpeng Chen | Jie M. Zhang | Yang Liu | Yun Ma
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.
LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs
Kaibo Liu | Zhenpeng Chen | Yiyang Liu | Jie M. Zhang | Mark Harman | Yudong Han | Yun Ma | Yihong Dong | Ge Li | Gang Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kaibo Liu | Zhenpeng Chen | Yiyang Liu | Jie M. Zhang | Mark Harman | Yudong Han | Yun Ma | Yihong Dong | Ge Li | Gang Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80×, 2.65×, and 1.66× those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher/.