Susana Sotelo Docio
Also published as: Susana Sotelo Docio
2025
IberoBench: A Benchmark for LLM Evaluation in Iberian Languages
Irene Baucells
|
Javier Aula-Blasco
|
Iria de-Dios-Flores
|
Silvia Paniagua Suárez
|
Naiara Perez
|
Anna Salles
|
Susana Sotelo Docio
|
Júlia Falcão
|
Jose Javier Saiz
|
Robiert Sepulveda Torres
|
Jeremy Barnes
|
Pablo Gamallo
|
Aitor Gonzalez-Agirre
|
German Rigau
|
Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics
The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.
Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study
Pablo Rodríguez
|
Silvia Paniagua Suárez
|
Pablo Gamallo
|
Susana Sotelo Docio
Findings of the Association for Computational Linguistics: ACL 2025
Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework.