MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages

Wenhan Han, Yifan Zhang, Zhixun Chen, Binbinliu, Mykola Pechenizkiy, Meng Fang, Yin Zheng


Abstract
Multilingual large language models (LLMs) are advancing rapidly, with new models frequently claiming support for an increasing number of languages. However, existing evaluation datasets are limited and lack cross-lingual alignment, leaving assessments of multilingual capabilities fragmented in both language and skill coverage. To address this, we introduce MuBench, a benchmark covering 61 languages with 3.9M samples and evaluating a broad range of capabilities. We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage, particularly a persistent performance disparity between English and low-resource languages. Leveraging MuBench’s alignment, we propose Multilingual Consistency (MLC) as a complementary metric to accuracy for analyzing performance bottlenecks and guiding model improvement. MuBench provides flexible evaluation formats, including mixed-language testing. Experimental results show that increasing model size does not improve its ability to handle mixed-language contexts. We recruited human experts to evaluate translation quality and cultural sensitivity for 34k samples across 17 languages, and combined these assessments with an LLM-as-a-Judge approach to ensure overall data quality in low resource languages.
Anthology ID:
2026.findings-acl.794
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16163–16192
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.794/
DOI:
Bibkey:
Cite (ACL):
Wenhan Han, Yifan Zhang, Zhixun Chen, Binbinliu, Mykola Pechenizkiy, Meng Fang, and Yin Zheng. 2026. MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages. In Findings of the Association for Computational Linguistics: ACL 2026, pages 16163–16192, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages (Han et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.794.pdf
Checklist:
 2026.findings-acl.794.checklist.pdf