Valle Ruiz-Fernández
2026
EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering
Valle Ruiz-Fernández | Mario Mina | Júlia Falcão | Luis Antonio Vasquez Reina | Anna Salles | Aitor Gonzalez-Agirre | Olatz Perez-de-Viñaspre
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Valle Ruiz-Fernández | Mario Mina | Júlia Falcão | Luis Antonio Vasquez Reina | Anna Salles | Aitor Gonzalez-Agirre | Olatz Perez-de-Viñaspre
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.
2025
Cognitive Biases, Task Complexity, and Result Interpretability in Large Language Models
Mario Mina | Valle Ruiz-Fernández | Júlia Falcão | Luis Vasquez-Reina | Aitor Gonzalez-Agirre
Proceedings of the 31st International Conference on Computational Linguistics
Mario Mina | Valle Ruiz-Fernández | Júlia Falcão | Luis Vasquez-Reina | Aitor Gonzalez-Agirre
Proceedings of the 31st International Conference on Computational Linguistics
In humans, cognitive biases are systematic deviations from rationality in judgment that simplify complex decisions. They typically manifest as a consequence of learned behaviors or limitations on information processing capabilities. Recent work has shown that these biases can percolate through training data and ultimately be learned by language models. We examine different groups of models, factoring in model size and type (base or instructed) for four kinds of cognitive bias: primacy, recency, common token, and majority class bias. We evaluate the performance of each model for each type of bias in different settings using simple and complex variants of datasets. Our results show that some biases have much stronger effects than others, and that task complexity plays a part in eliciting stronger effects for some of these biases as measured by effect size. We show that some cognitive biases such as common token and majority class bias are not straightforward to evaluate, and that, contrary to some of the previous literature, some effects that have been previously classified as common token bias in the literature are actually due to primacy and recency bias.
2024
BSC-LANGTECH at FIGNEWS 2024 Shared Task: Exploring Semi-Automatic Bias Annotation using Frame Analysis
Valle Ruiz-Fernández | José Saiz | Aitor Gonzalez-Agirre
Proceedings of the Second Arabic Natural Language Processing Conference
Valle Ruiz-Fernández | José Saiz | Aitor Gonzalez-Agirre
Proceedings of the Second Arabic Natural Language Processing Conference
This paper introduces the methodology of BSC-LANGTECH team for the FIGNEWS 2024 Shared Task on News Media Narratives. Following the bias annotation subtask, we apply the theory and methods of framing analysis to develop guidelines to annotate bias in the corpus provided by the task organizators. The manual annotation of a subset, with which a moderate IAA agreement has been achieved, is further used in Deep Learning techniques to explore automatic annotation and test the reliability of our framework.
EsCoLA: Spanish Corpus of Linguistic Acceptability
Núria Bel | Marta Punsola | Valle Ruiz-Fernández
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Núria Bel | Marta Punsola | Valle Ruiz-Fernández
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Acceptability is one of the General Language Understanding Evaluation Benchmark (GLUE) probing tasks proposed to assess the linguistic capabilities acquired by a deep-learning transformer-based language model (LM). In this paper, we introduce the Spanish Corpus of Linguistic Acceptability EsCoLA. EsCoLA has been developed following the example of other linguistic acceptability data sets for English, Italian, Norwegian or Russian, with the aim of having a complete GLUE benchmark for Spanish. EsCoLA consists of 11,174 sentences and their acceptability judgements as found in well-known Spanish reference grammars. Additionally, all sentences have been annotated with the class of linguistic phenomenon the sentence is an example of, also following previous practices. We also provide as task baselines the results of fine-tuning four different language models with this data set and the results of a human annotation experiment. Results are also analyzed and commented to guide future research. EsCoLA is released under a CC-BY 4.0 license and freely available at https://doi.org/10.34810/data1138.