Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature

Pietro Marini, Aécio Santos, Nicole Contaxis, Juliana Freire


Abstract
Despite growing emphasis on data sharing and the proliferation of open datasets, researchers face significant challenges in discovering relevant datasets for reuse and systematically identifying dataset references within scientific literature. We present Data Gatherer, an automated system that leverages large language models to identify and extract dataset references from scientific publications. To evaluate our approach, we developed and curated two high-quality benchmark datasets specifically designed for dataset identification tasks. Our experimental evaluation demonstrates that Data Gatherer achieves high precision and recall in automated dataset reference extraction, reducing the time and effort required for dataset discovery while improving the systematic identification of data sources in scholarly literature.
Anthology ID:
2025.sdp-1.10
Volume:
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
Venues:
sdp | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
114–123
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.sdp-1.10/
DOI:
Bibkey:
Cite (ACL):
Pietro Marini, Aécio Santos, Nicole Contaxis, and Juliana Freire. 2025. Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 114–123, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature (Marini et al., sdp 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.sdp-1.10.pdf