2025
pdf
bib
abs
IberoBench: A Benchmark for LLM Evaluation in Iberian Languages
Irene Baucells
|
Javier Aula-Blasco
|
Iria de-Dios-Flores
|
Silvia Paniagua Suárez
|
Naiara Perez
|
Anna Salles
|
Susana Sotelo Docio
|
Júlia Falcão
|
Jose Javier Saiz
|
Robiert Sepulveda Torres
|
Jeremy Barnes
|
Pablo Gamallo
|
Aitor Gonzalez-Agirre
|
German Rigau
|
Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics
The current best practice to measure the performance of base Large Language Models is to establish a multi-task benchmark that covers a range of capabilities of interest. Currently, however, such benchmarks are only available in a few high-resource languages. To address this situation, we present IberoBench, a multilingual, multi-task benchmark for Iberian languages (i.e., Basque, Catalan, Galician, European Spanish and European Portuguese) built on the LM Evaluation Harness framework. The benchmark consists of 62 tasks divided into 179 subtasks. We evaluate 33 existing LLMs on IberoBench on 0- and 5-shot settings. We also explore the issues we encounter when working with the Harness and our approach to solving them to ensure high-quality evaluation.
pdf
bib
abs
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Oscar Sainz
|
Naiara Perez
|
Julen Etxaniz
|
Joseba Fernandez de Landa
|
Itziar Aldabe
|
Iker García-Ferrero
|
Aimar Zabala
|
Ekhi Azurmendi
|
German Rigau
|
Eneko Agirre
|
Mikel Artetxe
|
Aitor Soroa
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
pdf
bib
abs
Multi-LMentry: Can Multilingual LLMs Solve Elementary Tasks Across Languages?
Luca Moroni
|
Javier Aula-Blasco
|
Simone Conia
|
Irene Baucells
|
Naiara Perez
|
Silvia Paniagua Suárez
|
Anna Sallés
|
Malte Ostendorff
|
Júlia Falcão
|
Guijin Son
|
Aitor Gonzalez-Agirre
|
Roberto Navigli
|
Marta Villegas
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models (LLMs) continue to improve, their evaluation increasingly centers on complex, high-level tasks, often at the expense of systematically assessing fundamental capabilities. To address this gap, recent work proposed LMentry, a compact benchmark comprising tasks that are trivial for humans but remain surprisingly difficult for LLMs. However, LMentry is limited to English, leaving its insights linguistically narrow. In this paper, we present Multi-LMentry, a ground-up recreation of LMentry that enables systematic evaluation of LLMs on basic reasoning and understanding tasks across nine diverse languages. Multi-LMentry includes English and expands to Basque, Brazilian Portuguese, Catalan, Galician, German, Italian, Korean, and Spanish, emphasizing the importance of cross-lingual and low-resource settings. To validate that Multi-LMentry is still trivial for humans, we demonstrate that L2 speakers with only elementary proficiency achieve near-perfect scores in a low-resource language, namely, Basque. Through extensive experiments, we reveal that state-of-the-art open-weight multilingual LLMs still fall short of human performance on elementary tasks in many languages. Our results expose new failure modes that remain hidden in monolingual evaluation, underscoring the need for rigorous, language-diverse “unit tests” of core model abilities.
2024
pdf
bib
abs
Latxa: An Open Language Model and Evaluation Suite for Basque
Julen Etxaniz
|
Oscar Sainz
|
Naiara Perez
|
Itziar Aldabe
|
German Rigau
|
Eneko Agirre
|
Aitor Ormazabal
|
Mikel Artetxe
|
Aitor Soroa
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,046 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.
2020
pdf
bib
abs
Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT
Aitor García Pablos
|
Naiara Perez
|
Montse Cuadros
Proceedings of the Twelfth Language Resources and Evaluation Conference
Massive digital data processing provides a wide range of opportunities and benefits, but at the cost of endangering personal data privacy. Anonymisation consists in removing or replacing sensitive information from data, enabling its exploitation for different purposes while preserving the privacy of individuals. Over the years, a lot of automatic anonymisation systems have been proposed; however, depending on the type of data, the target language or the availability of training documents, the task remains challenging still. The emergence of novel deep-learning models during the last two years has brought large improvements to the state of the art in the field of Natural Language Processing. These advancements have been most noticeably led by BERT, a model proposed by Google in 2018, and the shared language models pre-trained on millions of documents. In this paper, we use a BERT-based sequence labelling model to conduct a series of anonymisation experiments on several clinical datasets in Spanish. We also compare BERT with other algorithms. The experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
pdf
bib
abs
NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts
Salvador Lima Lopez
|
Naiara Perez
|
Montse Cuadros
|
German Rigau
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper introduces the first version of the NUBes corpus (Negation and Uncertainty annotations in Biomedical texts in Spanish). The corpus is part of an on-going research and currently consists of 29,682 sentences obtained from anonymised health records annotated with negation and uncertainty. The article includes an exhaustive comparison with similar corpora in Spanish, and presents the main annotation and design decisions. Additionally, we perform preliminary experiments using deep learning algorithms to validate the annotated dataset. As far as we know, NUBes is the largest available corpora for negation in Spanish and the first that also incorporates the annotation of speculation cues, scopes, and events.
pdf
bib
abs
HitzalMed: Anonymisation of Clinical Text in Spanish
Salvador Lima Lopez
|
Naiara Perez
|
Laura García-Sardiña
|
Montse Cuadros
Proceedings of the Twelfth Language Resources and Evaluation Conference
HitzalMed is a web-framed tool that performs automatic detection of sensitive information in clinical texts using machine learning algorithms reported to be competitive for the task. Moreover, once sensitive information is detected, different anonymisation techniques are implemented that are configurable by the user –for instance, substitution, where sensitive items are replaced by same category text in an effort to generate a new document that looks as natural as the original one. The tool is able to get data from different document formats and outputs downloadable anonymised data. This paper presents the anonymisation and substitution technology and the demonstrator which is publicly available at
https://snlt.vicomtech.org/hitzalmed.
2018
pdf
bib
Biomedical term normalization of EHRs with UMLS
Naiara Perez-Miguel
|
Montse Cuadros
|
German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
Hate Speech Dataset from a White Supremacy Forum
Ona de Gibert
|
Naiara Perez
|
Aitor García-Pablos
|
Montse Cuadros
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Hate speech is commonly defined as any communication that disparages a target group of people based on some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic. Due to the massive rise of user-generated web content on social media, the amount of hate speech is also steadily increasing. Over the past years, interest in online hate speech detection and, particularly, the automation of this task has continuously grown, along with the societal impact of the phenomenon. This paper describes a hate speech dataset composed of thousands of sentences manually labelled as containing hate speech or not. The sentences have been extracted from Stormfront, a white supremacist forum. A custom annotation tool has been developed to carry out the manual labelling task which, among other things, allows the annotators to choose whether to read the context of a sentence before labelling it. The paper also provides a thoughtful qualitative and quantitative study of the resulting dataset and several baseline experiments with different classification models. The dataset is publicly available.
2017
pdf
bib
abs
Multilingual CALL Framework for Automatic Language Exercise Generation from Free Text
Naiara Perez
|
Montse Cuadros
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics
This paper describes a web-based application to design and answer exercises for language learning. It is available in Basque, Spanish, English, and French. Based on open-source Natural Language Processing (NLP) technology such as word embedding models and word sense disambiguation, the application enables users to automatic create easily and in real time three types of exercises, namely, Fill-in-the-Gaps, Multiple Choice, and Shuffled Sentences questionnaires. These are generated from texts of the users’ own choice, so they can train their language skills with content of their particular interest.
2016
pdf
bib
abs
Exploiting a Large Strongly Comparable Corpus
Thierry Etchegoyhen
|
Andoni Azpeitia
|
Naiara Pérez
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven machine translation systems for this language pair. Competing approaches were explored for the alignment of comparable segments in the corpus, resulting in the design of a simple method which outperformed a state-of-the-art method on the corpus test sets. The method we present is highly portable, computationally efficient, and significantly reduces deployment work, a welcome result for the exploitation of comparable corpora.