@inproceedings{yang-etal-2025-llms,
    title = "Can {LLM}s Generate High-Quality Test Cases for Algorithm Problems? {T}est{C}ase-Eval: A Systematic Evaluation of Fault Coverage and Exposure",
    author = "Yang, Zheyuan  and
      Kuang, Zexi  and
      Xia, Xue  and
      Zhao, Yilun",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2025.acl-short.82/",
    doi = "10.18653/v1/2025.acl-short.82",
    pages = "1050--1063",
    ISBN = "979-8-89176-252-7",
    abstract = "We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems."
}Markdown (Informal)
[Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure](https://preview.aclanthology.org/ingest-emnlp/2025.acl-short.82/) (Yang et al., ACL 2025)
ACL