Damián Furman
2025
MessIRve: A Large-Scale Spanish Information Retrieval Dataset
Francisco Valentini
|
Viviana Cotik
|
Damián Furman
|
Ivan Bercovich
|
Edgar Altszyler
|
Juan Manuel Pérez
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
2023
High-quality argumentative information in low resources approaches improve counter-narrative generation
Damián Furman
|
Pablo Torres
|
José Rodríguez
|
Diego Letzen
|
Maria Martinez
|
Laura Alemany
Findings of the Association for Computational Linguistics: EMNLP 2023
It has been shown that high quality fine-tuning boosts the performance of language models, even if the size of the fine-tuning is small. In this work we show how highly targeted fine-tuning improves the task of hate speech counter-narrative generation in user-generated text, even for very small sizes of training (1722 counter-narratives for English and 355 for Spanish). Providing a small subset of examples focusing on single argumentative strategies, together with the argumentative analysis relevant to that strategy, yields counter-narratives that are as satisfactory as providing the whole set of counter-narratives. We also show that a good base model is required for the fine-tuning to have a positive impact. Indeed, for Spanish, the counter-narratives obtained without fine-tuning are mostly unacceptable, and, while fine-tuning improves their overall quality, the performance still remains quite unsatisfactory.
Search
Fix author
Co-authors
- Laura Alemany 1
- Edgar Altszyler 1
- Ivan Bercovich 1
- Viviana Cotik 1
- Diego Letzen 1
- show all...