Abstract
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga).- Anthology ID:
- 2024.findings-emnlp.996
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17113–17126
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-emnlp.996/
- DOI:
- 10.18653/v1/2024.findings-emnlp.996
- Cite (ACL):
- Ankit Yadav, Himanshu Beniwal, and Mayank Singh. 2024. PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17113–17126, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs (Yadav et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2024.findings-emnlp.996.pdf