Irune Zubiaga
2026
Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque
David Ponce | Harritxu Gete | Thierry Etchegoyhen | Irune Zubiaga | Aitor Soroa
Proceedings of the Fifteenth Language Resources and Evaluation Conference
David Ponce | Harritxu Gete | Thierry Etchegoyhen | Irune Zubiaga | Aitor Soroa
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Evaluating the quality of answers to a given instruction is a demanding and time-consuming task, limiting the scalability of human assessment. Large language models (LLMs) have been proposed as automatic judges to reduce this effort, but their reliability in low-resource contexts remains uncertain. Additionally, the premise that humans are reliable judges of fine-grained response quality needs to be assessed as well, if correlation with automated judges on this task is to be considered a gold standard. In this work, we investigate the performance of various LLM-as-a-judge in a low-resource scenario, namely Basque, and evaluate its correlation with human judgements. Additionally, we measure the agreement between human judgments themselves, to assess their viability as a valid reference. To perform our experiments, we translated and manually post-edited the Just-Eval benchmark, a suite of benchmarks tackling fine-grained aspects of response quality. We also extend the evaluation with a novel category aimed at judging both language consistency and grammaticality. Our results show that state of the art models exhibit fairly poor correlations with humans and amongst themselves, calling for the development of dedicated LLM-as-a-judge models for this language.
2025
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
María Grandury | Javier Aula-Blasco | Júlia Falcão | Clémentine Fourrier | Miguel González Saiz | Gonzalo Martínez | Gonzalo Santamaria Gomez | Rodrigo Agerri | Nuria Aldama García | Luis Chiruzzo | Javier Conde | Helena Gomez Adorno | Marta Guerrero Nieto | Guido Ivetta | Natàlia López Fuertes | Flor Miriam Plaza-del-Arco | María-Teresa Martín-Valdivia | Helena Montoro Zamorano | Carmen Muñoz Sanz | Pedro Reviriego | Leire Rosado Plaza | Alejandro Vaca Serrano | Estrella Vallecillo-Rodríguez | Jorge Vallego | Irune Zubiaga
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
María Grandury | Javier Aula-Blasco | Júlia Falcão | Clémentine Fourrier | Miguel González Saiz | Gonzalo Martínez | Gonzalo Santamaria Gomez | Rodrigo Agerri | Nuria Aldama García | Luis Chiruzzo | Javier Conde | Helena Gomez Adorno | Marta Guerrero Nieto | Guido Ivetta | Natàlia López Fuertes | Flor Miriam Plaza-del-Arco | María-Teresa Martín-Valdivia | Helena Montoro Zamorano | Carmen Muñoz Sanz | Pedro Reviriego | Leire Rosado Plaza | Alejandro Vaca Serrano | Estrella Vallecillo-Rodríguez | Jorge Vallego | Irune Zubiaga
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Catalan, Basque, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Raez | Aitor Soroa | María-Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Raez | Aitor Soroa | María-Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
This paper presents an overview of the Shared Task organized in the First Workshop on Multilingual Counterspeech Generation at COLING 2025. While interest in automatic approaches to Counterspeech generation has been steadily growing, the large majority of the published experimental work has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. The task’s goal is to promote and encourage research on Counterspeech generation in a multilingual setting (Basque, English, Italian, and Spanish) potentially leveraging background knowledge provided in the proposed dataset. The task attracted 11 participants, 9 of whom presented a paper describing their systems. Together with the task, we introduce a new multilingual counterspeech dataset with 2384 triplets of hate speech, counterspeech, and related background knowledge covering 4 languages. The dataset is available at: https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN.
Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Ráez | Aitor Soroa | María Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Ráez | Aitor Soroa | María Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
2024
A LLM-based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
Irune Zubiaga | Aitor Soroa | Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2024
Irune Zubiaga | Aitor Soroa | Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2024
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a ρ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
Search
Fix author
Co-authors
- Rodrigo Agerri 4
- Aitor Soroa 4
- Helena Bonaldi 2
- Marco Guerini 2
- Arturo Montejo-Ráez 2
- María Estrella Vallecillo-Rodríguez 2
- Javier Aula-Blasco 1
- Luis Chiruzzo 1
- Javier Conde 1
- Thierry Etchegoyhen 1
- Júlia Falcão 1
- Clémentine Fourrier 1
- Natàlia López Fuertes 1
- Nuria Aldama García 1
- Harritxu Gete 1
- Gonzalo Santamaria Gomez 1
- Helena Gomez Adorno 1
- María Grandury 1
- Guido Ivetta 1
- María-Teresa Martín-Valdivia 1
- María-Teresa Martín-Valdivia 1
- M. Teresa Martín-Valdivia 1
- Gonzalo Martínez 1
- Marta Guerrero Nieto 1
- Leire Rosado Plaza 1
- Flor Miriam Plaza-del-Arco 1
- David Ponce 1
- Pedro Reviriego 1
- Miguel González Saiz 1
- Carmen Muñoz Sanz 1
- Alejandro Vaca Serrano 1
- Estrella Vallecillo-Rodríguez 1
- Jorge Vallego 1
- Helena Montoro Zamorano 1