M-IFEval: Multilingual Instruction-Following Evaluation

Antoine Dussolle, A. Cardeña, Shota Sato, Peter Devine


Abstract
Instruction following is a core capability of modern Large language models (LLMs), making evaluating this capability essential to understanding these models. The Instruction Following Evaluation (IFEval) benchmark from the literature does this using objective criteria, offering a measure of LLM performance without subjective AI or human judgement. However, it only includes English instructions, limiting its ability to assess LLMs in other languages.We propose the Multilingual Instruction Following Evaluation (M-IFEval) benchmark, expanding the evaluation to French, Japanese, and Spanish, with both general and language-specific instructions. Applying this benchmark to 8 state-of-the-art LLMs, we find that benchmark performance across languages and instruction types can vary widely, underscoring the importance of a multilingual benchmark for evaluating LLMs in a diverse cultural context.
Anthology ID:
2025.findings-naacl.344
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6161–6176
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.findings-naacl.344/
DOI:
10.18653/v1/2025.findings-naacl.344
Bibkey:
Cite (ACL):
Antoine Dussolle, A. Cardeña, Shota Sato, and Peter Devine. 2025. M-IFEval: Multilingual Instruction-Following Evaluation. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 6161–6176, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
M-IFEval: Multilingual Instruction-Following Evaluation (Dussolle et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.findings-naacl.344.pdf