Nicole Contaxis
2025
Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature
Pietro Marini
|
Aécio Santos
|
Nicole Contaxis
|
Juliana Freire
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
Despite growing emphasis on data sharing and the proliferation of open datasets, researchers face significant challenges in discovering relevant datasets for reuse and systematically identifying dataset references within scientific literature. We present Data Gatherer, an automated system that leverages large language models to identify and extract dataset references from scientific publications. To evaluate our approach, we developed and curated two high-quality benchmark datasets specifically designed for dataset identification tasks. Our experimental evaluation demonstrates that Data Gatherer achieves high precision and recall in automated dataset reference extraction, reducing the time and effort required for dataset discovery while improving the systematic identification of data sources in scholarly literature.