Irune Zubiaga

2026

Judging Instruction Responses in a Low-Resource Language: A Case Study on Basque
David Ponce | Harritxu Gete | Thierry Etchegoyhen | Irune Zubiaga | Aitor Soroa
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Evaluating the quality of answers to a given instruction is a demanding and time-consuming task, limiting the scalability of human assessment. Large language models (LLMs) have been proposed as automatic judges to reduce this effort, but their reliability in low-resource contexts remains uncertain. Additionally, the premise that humans are reliable judges of fine-grained response quality needs to be assessed as well, if correlation with automated judges on this task is to be considered a gold standard. In this work, we investigate the performance of various LLM-as-a-judge in a low-resource scenario, namely Basque, and evaluate its correlation with human judgements. Additionally, we measure the agreement between human judgments themselves, to assess their viability as a valid reference. To perform our experiments, we translated and manually post-edited the Just-Eval benchmark, a suite of benchmarks tackling fine-grained aspects of response quality. We also extend the evaluation with a novel category aimed at judging both language consistency and grammaticality. Our results show that state of the art models exhibit fairly poor correlations with humans and amongst themselves, calling for the development of dedicated LLM-as-a-judge models for this language.

2025

Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Catalan, Basque, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.

pdf bib abs

The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Raez | Aitor Soroa | María-Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation

This paper presents an overview of the Shared Task organized in the First Workshop on Multilingual Counterspeech Generation at COLING 2025. While interest in automatic approaches to Counterspeech generation has been steadily growing, the large majority of the published experimental work has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. The task’s goal is to promote and encourage research on Counterspeech generation in a multilingual setting (Basque, English, Italian, and Spanish) potentially leveraging background knowledge provided in the proposed dataset. The task attracted 11 participants, 9 of whom presented a paper describing their systems. Together with the task, we introduce a new multilingual counterspeech dataset with 2384 triplets of hate speech, counterspeech, and related background knowledge covering 4 languages. The dataset is available at: https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN.

pdf bib

2024

pdf bib abs

A LLM-based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
Irune Zubiaga | Aitor Soroa | Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2024

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a ρ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.

Irune Zubiaga

2026

2025

2024

Co-authors

Venues