2025
pdf
bib
abs
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
María Grandury
|
Javier Aula-Blasco
|
Júlia Falcão
|
Clémentine Fourrier
|
Miguel González Saiz
|
Gonzalo Martínez
|
Gonzalo Santamaria Gomez
|
Rodrigo Agerri
|
Nuria Aldama García
|
Luis Chiruzzo
|
Javier Conde
|
Helena Gomez Adorno
|
Marta Guerrero Nieto
|
Guido Ivetta
|
Natàlia López Fuertes
|
Flor Miriam Plaza-del-Arco
|
María-Teresa Martín-Valdivia
|
Helena Montoro Zamorano
|
Carmen Muñoz Sanz
|
Pedro Reviriego
|
Leire Rosado Plaza
|
Alejandro Vaca Serrano
|
Estrella Vallecillo-Rodríguez
|
Jorge Vallego
|
Irune Zubiaga
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Catalan, Basque, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
pdf
bib
Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi
|
María Estrella Vallecillo-Rodríguez
|
Irune Zubiaga
|
Arturo Montejo-Ráez
|
Aitor Soroa
|
María Teresa Martín-Valdivia
|
Marco Guerini
|
Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
pdf
bib
abs
The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
Helena Bonaldi
|
María Estrella Vallecillo-Rodríguez
|
Irune Zubiaga
|
Arturo Montejo-Raez
|
Aitor Soroa
|
María-Teresa Martín-Valdivia
|
Marco Guerini
|
Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation
This paper presents an overview of the Shared Task organized in the First Workshop on Multilingual Counterspeech Generation at COLING 2025. While interest in automatic approaches to Counterspeech generation has been steadily growing, the large majority of the published experimental work has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. The task’s goal is to promote and encourage research on Counterspeech generation in a multilingual setting (Basque, English, Italian, and Spanish) potentially leveraging background knowledge provided in the proposed dataset. The task attracted 11 participants, 9 of whom presented a paper describing their systems. Together with the task, we introduce a new multilingual counterspeech dataset with 2384 triplets of hate speech, counterspeech, and related background knowledge covering 4 languages. The dataset is available at: https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN.
2024
pdf
bib
abs
A LLM-based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
Irune Zubiaga
|
Aitor Soroa
|
Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2024
This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a ρ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.