Test Set Quality in Multilingual LLM Evaluation

Chalamalasetti Kranti, Gabriel Bernier-Colborne, Yvan Gauthier, Sowmya Vajjala


Abstract
Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models (LLM). However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages – French and Telugu, identifying several errors in the datasets during the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages. Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.
Anthology ID:
2025.eval4nlp-1.14
Volume:
Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Mousumi Akter, Tahiya Chowdhury, Steffen Eger, Christoph Leiter, Juri Opitz, Erion Çano
Venues:
Eval4NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
167–178
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1.14/
DOI:
Bibkey:
Cite (ACL):
Chalamalasetti Kranti, Gabriel Bernier-Colborne, Yvan Gauthier, and Sowmya Vajjala. 2025. Test Set Quality in Multilingual LLM Evaluation. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, pages 167–178, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
Test Set Quality in Multilingual LLM Evaluation (Kranti et al., Eval4NLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.eval4nlp-1.14.pdf