mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri


Abstract
Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However, this and other Code LLM benchmarks face critical limitations, particularly in task diversity, test coverage, and linguistic scope. Current evaluations primarily focus on English-to-Python conversion tasks with limited test cases, potentially overestimating model performance. While recent works have addressed test coverage and programming language (PL) diversity, code generation from low-resource language prompts remains largely unexplored. To address this gap, we introduce mHumanEval, an extended benchmark supporting prompts in over 200 natural languages. We employ established machine translation methods to compile the benchmark, coupled with a quality assurance process. Furthermore, we provide expert human translations for 15 diverse natural languages (NLs). We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs, offering insights into the current landscape of cross-lingual code generation.
Anthology ID:
2025.naacl-long.570
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11432–11461
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.570/
DOI:
Bibkey:
Cite (ACL):
Nishat Raihan, Antonios Anastasopoulos, and Marcos Zampieri. 2025. mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 11432–11461, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
mHumanEval - A Multilingual Benchmark to Evaluate Large Language Models for Code Generation (Raihan et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.570.pdf