Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature
Pietro Marini, Aécio Santos, Nicole Contaxis, Juliana Freire
Abstract
Despite growing emphasis on data sharing and the proliferation of open datasets, researchers face significant challenges in discovering relevant datasets for reuse and systematically identifying dataset references within scientific literature. We present Data Gatherer, an automated system that leverages large language models to identify and extract dataset references from scientific publications. To evaluate our approach, we developed and curated two high-quality benchmark datasets specifically designed for dataset identification tasks. Our experimental evaluation demonstrates that Data Gatherer achieves high precision and recall in automated dataset reference extraction, reducing the time and effort required for dataset discovery while improving the systematic identification of data sources in scholarly literature.- Anthology ID:
- 2025.sdp-1.10
- Volume:
- Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Tirthankar Ghosal, Philipp Mayr, Amanpreet Singh, Aakanksha Naik, Georg Rehm, Dayne Freitag, Dan Li, Sonja Schimmler, Anita De Waard
- Venues:
- sdp | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 114–123
- Language:
- URL:
- https://preview.aclanthology.org/display_plenaries/2025.sdp-1.10/
- DOI:
- Cite (ACL):
- Pietro Marini, Aécio Santos, Nicole Contaxis, and Juliana Freire. 2025. Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pages 114–123, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature (Marini et al., sdp 2025)
- PDF:
- https://preview.aclanthology.org/display_plenaries/2025.sdp-1.10.pdf