TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang; Chenyuan Yang; Zhijie Wang; Yuheng Huang; Zhaoyang Chu; Da Song; Lingming Zhang; An Ran Chen; Lei Ma

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, Lei Ma

Abstract

For program languages, testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities.In this paper, we propose TestEval, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate 17 popular LLMs, including both commercial and open-source ones, on TestEval. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths.

Anthology ID:: 2025.findings-naacl.197
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3547–3562
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.findings-naacl.197/
DOI:
Bibkey:
Cite (ACL):: Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. TESTEVAL: Benchmarking Large Language Models for Test Case Generation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3547–3562, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: TESTEVAL: Benchmarking Large Language Models for Test Case Generation (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.findings-naacl.197.pdf

PDF Cite Search Fix data