IberoBench: A Benchmark for LLM Evaluation in Iberian Languages

Irene Baucells; Javier Aula-Blasco; Iria de-Dios-Flores; Silvia Paniagua Suárez; Naiara Pérez; Anna Salles; Susana Sotelo Docio; Júlia Falcão; José Javier Saiz; Robiert Sepúlveda-Torres; Jeremy Barnes; Pablo Gamallo; Aitor González-Agirre; German Rigau; Marta Villegas

IberoBench: A Benchmark for LLM Evaluation in Iberian Languages

Irene Baucells, Javier Aula-Blasco, Iria de-Dios-Flores, Silvia Paniagua Suárez, Naiara Perez, Anna Salles, Susana Sotelo Docio, Júlia Falcão, Jose Javier Saiz, Robiert Sepulveda Torres, Jeremy Barnes, Pablo Gamallo, Aitor Gonzalez-Agirre, German Rigau, Marta Villegas

Abstract

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.

Anthology ID:: 2025.coling-main.699
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10491–10519
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.699/
DOI:
Bibkey:
Cite (ACL):: Irene Baucells, Javier Aula-Blasco, Iria de-Dios-Flores, Silvia Paniagua Suárez, Naiara Perez, Anna Salles, Susana Sotelo Docio, Júlia Falcão, Jose Javier Saiz, Robiert Sepulveda Torres, Jeremy Barnes, Pablo Gamallo, Aitor Gonzalez-Agirre, German Rigau, and Marta Villegas. 2025. IberoBench: A Benchmark for LLM Evaluation in Iberian Languages. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10491–10519, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: IberoBench: A Benchmark for LLM Evaluation in Iberian Languages (Baucells et al., COLING 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.coling-main.699.pdf

PDF Cite Search Fix data