PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Ankit Yadav; Himanshu Beniwal; Mayank Singh

doi:10.18653/v1/2024.findings-emnlp.996

PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs

Ankit Yadav, Himanshu Beniwal, Mayank Singh

Abstract

Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga).

Anthology ID:: 2024.findings-emnlp.996
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17113–17126
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2024.findings-emnlp.996/
DOI:: 10.18653/v1/2024.findings-emnlp.996
Bibkey:
Cite (ACL):: Ankit Yadav, Himanshu Beniwal, and Mayank Singh. 2024. PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17113–17126, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs (Yadav et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2024.findings-emnlp.996.pdf
Data:: 2024.findings-emnlp.996.data.zip

PDF Search Data Fix data