Irene Baucells


2025

pdf bib
IberoBench: A Benchmark for LLM Evaluation in Iberian Languages
Irene Baucells | Javier Aula-Blasco | Iria de-Dios-Flores | Silvia Paniagua Suárez | Naiara Perez | Anna Salles | Susana Sotelo Docio | Júlia Falcão | Jose Javier Saiz | Robiert Sepulveda Torres | Jeremy Barnes | Pablo Gamallo | Aitor Gonzalez-Agirre | German Rigau | Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics

The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.

pdf bib
Multi-LMentry: Can Multilingual LLMs Solve Elementary Tasks Across Languages?
Luca Moroni | Javier Aula-Blasco | Simone Conia | Irene Baucells | Naiara Perez | Silvia Paniagua Suárez | Anna Sallés | Malte Ostendorff | Júlia Falcão | Guijin Son | Aitor Gonzalez-Agirre | Roberto Navigli | Marta Villegas
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

As large language models (LLMs) continue to improve, their evaluation increasingly centers on complex, high-level tasks, often at the expense of systematically assessing fundamental capabilities. To address this gap, recent work proposed LMentry, a compact benchmark comprising tasks that are trivial for humans but remain surprisingly difficult for LLMs. However, LMentry is limited to English, leaving its insights linguistically narrow. In this paper, we present Multi-LMentry, a ground-up recreation of LMentry that enables systematic evaluation of LLMs on basic reasoning and understanding tasks across nine diverse languages. Multi-LMentry includes English and expands to Basque, Brazilian Portuguese, Catalan, Galician, German, Italian, Korean, and Spanish, emphasizing the importance of cross-lingual and low-resource settings. To validate that Multi-LMentry is still trivial for humans, we demonstrate that L2 speakers with only elementary proficiency achieve near-perfect scores in a low-resource language, namely, Basque. Through extensive experiments, we reveal that state-of-the-art open-weight multilingual LLMs still fall short of human performance on elementary tasks in many languages. Our results expose new failure modes that remain hidden in monolingual evaluation, underscoring the need for rigorous, language-diverse “unit tests” of core model abilities.

pdf bib
From SALAMANDRA to SALAMANDRATA: BSC Submission for WMT25 General Machine Translation Shared Task
Javier Garcia Gilabert | Xixian Liao | Severino Da Dalt | Ella Bohman | Audrey Mash | Francesca De Luca Fornaciari | Irene Baucells | Joan Llop | Miguel Claramunt | Carlos Escolano | Maite Melero
Proceedings of the Tenth Conference on Machine Translation

In this paper, we present the SalamandraTA family of models, an improved iteration of Salamandra LLMs (Gonzalez-Agirre et al., 2025) specifically trained to achieve strong performance in translation-related tasks for 38 European languages. SalamandraTA comes in two scales: 2B and 7B parameters. For both versions, we applied the same training recipe with a first step of continual pre-training on parallel data, and a second step of supervised fine-tuning on high-quality instructions.The BSC submission to the WMT25 General Machine Translation shared task is based on the 7B variant of SalamandraTA. We first extended the model vocabulary to support the additional non-European languages included in the task. This was followed by a second phase of continual pretraining and supervised fine-tuning, carefully designed to optimize performance across all translation directions for this year’s shared task. For decoding, we employed two quality-aware strategies: Minimum Bayes Risk Decoding and Translation Reranking using Comet and Comet-kiwi.We publicly release both the 2B and 7B versions of SalamandraTA, along with the newer SalamandraTA-v2 model, on Hugging Face.

2024

pdf bib
Building a Data Infrastructure for a Mid-Resource Language: The Case of Catalan
Aitor Gonzalez-Agirre | Montserrat Marimon | Carlos Rodriguez-Penagos | Javier Aula-Blasco | Irene Baucells | Carme Armentano-Oller | Jorge Palomar-Giner | Baybars Kulebi | Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Current LLM-based applications are becoming steadily available for everyone with a reliable access to technology and the internet. These applications offer benefits to their users that leave those without access to them at a serious disadvantage. Given the vastly large amount of data needed to train LLMs, the gap between languages with access to such quantity of data and those without it is currently larger than ever. Aimed at saving this gap, the Aina Project was created to provide Catalan with the necessary resources to keep being relevant in the context of AI/NLP applications based on LLMs. We thus present a set of strategies to consider when improving technology support for a mid- or low-resource language, specially addressing sustainability of high-quality data acquisition and the challenges involved in the process. We also introduce a large amount of new annotated data for Catalan. Our hope is that those interested in replicating this work for another language can learn from what worked for us, the challenges that we faced, and the sometimes disheartening truth of working with mid- and low-resource languages.

pdf bib
FLOR: On the Effectiveness of Language Adaptation
Severino Da Dalt | Joan Llop | Irene Baucells | Marc Pamies | Yishi Xu | Aitor Gonzalez-Agirre | Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models have amply proven their great capabilities, both in downstream tasks and real-life settings. However, low- and mid-resource languages do not have access to the necessary means to train such models from scratch, and often have to rely on multilingual models despite being underrepresented in the training data. For the particular case of the Catalan language, we prove that continued pre-training with vocabulary adaptation is a better alternative to take the most out of already pre-trained models, even if these have not seen any Catalan data during their pre-training phase. We curate a 26B tokens corpus and use it to further pre-train BLOOM, giving rise to the FLOR models. We perform an extensive evaluation to assess the effectiveness of our method, obtaining consistent gains across Catalan and Spanish tasks. The models, training data, and evaluation framework are made freely available under permissive licenses.

2023

pdf bib
Dynamic Stance: Modeling Discussions by Labeling the Interactions
Blanca Figueras | Irene Baucells | Tommaso Caselli
Findings of the Association for Computational Linguistics: EMNLP 2023

Stance detection is an increasingly popular task that has been mainly modeled as a static task, by assigning the expressed attitude of a text toward a given topic. Such a framing presents limitations, with trained systems showing poor generalization capabilities and being strongly topic-dependent. In this work, we propose modeling stance as a dynamic task, by focusing on the interactions between a message and their replies. For this purpose, we present a new annotation scheme that enables the categorization of all kinds of textual interactions. As a result, we have created a new corpus, the Dynamic Stance Corpus (DySC), consisting of three datasets in two middle-resourced languages: Catalan and Dutch. Our data analysis further supports our modeling decisions, empirically showing differences between the annotation of stance in static and dynamic contexts. We fine-tuned a series of monolingual and multilingual models on DySC, showing portability across topics and languages.