Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale; Inioluwa Deborah Raji; Suresh Venkatasubramanian

Multi-lingual Functional Evaluation for Large Language Models

Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian

Abstract

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM. However, these evaluations often fail to provide an adequate understanding of the practical performance and robustness of models across multi-lingual settings. In response, we create multi-lingual functional benchmarks – Cross-Lingual Grade School Math Symbolic (CL-GSM Symbolic) and Cross-Lingual Instruction-Following Eval (CL-IFEval)– by translating existing functional benchmark templates from English to five additional languages that span the range of resources available for NLP: French, Spanish, Hindi, Arabic and Yoruba. Our results show that the gap between static and functional evaluations is highly uneven: across models, performance drops from M-GSM to CL-GSM Symbolic by 24%, 17%, and 18% in English, French, and Spanish, while the drop from Belebele to CL-IFEval ranges from 15% to 24% across languages, and the drop from M-MMLU to CL-IFEval is much smaller (0.5% to 3%).Similarly, we find that model robustness across languages varies significantly, with certain languages (eg. Arabic, English) being the most consistently well performing across evaluation iterations.

Anthology ID:: 2026.findings-acl.1731
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34672–34691
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1731/
DOI:
Bibkey:
Cite (ACL):: Victor Ojewale, Inioluwa Deborah Raji, and Suresh Venkatasubramanian. 2026. Multi-lingual Functional Evaluation for Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34672–34691, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Multi-lingual Functional Evaluation for Large Language Models (Ojewale et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1731.pdf
Checklist:: 2026.findings-acl.1731.checklist.pdf

PDF Cite Search Checklist Fix data