CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma
Abstract
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model’s code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs’ code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark’s ability to probe deeper into models’ code understanding abilities. Our benchmark is available at https://github.com/CodeLLM-Research/CodeJudge-Eval .- Anthology ID:
- 2025.coling-main.7
- Volume:
- Proceedings of the 31st International Conference on Computational Linguistics
- Month:
- January
- Year:
- 2025
- Address:
- Abu Dhabi, UAE
- Editors:
- Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
- Venue:
- COLING
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 73–95
- Language:
- URL:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.7/
- DOI:
- Cite (ACL):
- Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, and Jing Ma. 2025. CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?. In Proceedings of the 31st International Conference on Computational Linguistics, pages 73–95, Abu Dhabi, UAE. Association for Computational Linguistics.
- Cite (Informal):
- CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding? (Zhao et al., COLING 2025)
- PDF:
- https://preview.aclanthology.org/jlcl-multiple-ingestion/2025.coling-main.7.pdf