MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Francisco Valentini, Viviana Cotik, Damián Furman, Ivan Bercovich, Edgar Altszyler, Juan Manuel Pérez


Abstract
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
Anthology ID:
2025.emnlp-main.1412
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27740–27757
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1412/
DOI:
Bibkey:
Cite (ACL):
Francisco Valentini, Viviana Cotik, Damián Furman, Ivan Bercovich, Edgar Altszyler, and Juan Manuel Pérez. 2025. MessIRve: A Large-Scale Spanish Information Retrieval Dataset. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27740–27757, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
MessIRve: A Large-Scale Spanish Information Retrieval Dataset (Valentini et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1412.pdf
Checklist:
 2025.emnlp-main.1412.checklist.pdf