2025
pdf
bib
abs
VeritasQA: A Truthfulness Benchmark Aimed at Multilingual Transferability
Javier Aula-Blasco
|
Júlia Falcão
|
Susana Sotelo
|
Silvia Paniagua
|
Aitor Gonzalez-Agirre
|
Marta Villegas
Proceedings of the 31st International Conference on Computational Linguistics
As Large Language Models (LLMs) become available in a wider range of domains and applications, evaluating the truthfulness of multilingual LLMs is an issue of increasing relevance. TruthfulQA (Lin et al., 2022) is one of few benchmarks designed to evaluate how models imitate widespread falsehoods. However, it is strongly English-centric and starting to become outdated. We present VeritasQA, a context- and time-independent truthfulness benchmark built with multilingual transferability in mind, and available in Spanish, Catalan, Galician and English. VeritasQA comprises a set of 353 questions and answers inspired by common misconceptions and falsehoods that are not tied to any particular country or recent event. We release VeritasQA under an open license and present the evaluation results of 15 models of various architectures and sizes.
pdf
bib
abs
Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study
Pablo Rodríguez
|
Silvia Paniagua Suárez
|
Pablo Gamallo
|
Susana Sotelo Docio
Findings of the Association for Computational Linguistics: ACL 2025
Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework.
2019
pdf
bib
abs
Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora
Pablo Gamallo
|
Susana Sotelo
|
José Ramom Pichel
|
Mikel Artetxe
Computational Linguistics, Volume 45, Issue 3 - September 2019
This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.
pdf
bib
abs
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.
Marcos Garcia
|
Marcos García Salido
|
Susana Sotelo
|
Estela Mosqueira
|
Margarita Alonso-Ramos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.
2000
pdf
bib
An Architecture for Document Routing in Spanish: Two Language Components, Pre-processor and Parser
Guillermo Rojo
|
Maria Concepción Álvarez
|
Pilar Alvariño
|
Adelaida Gil
|
María Paula Santalla
|
Susana Sotelo
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)