GraDaSE: Graph-Based Dataset Search with Examples

Jing He, Mingyang Lv, Qing Shi, Gong Cheng


Abstract
Dataset search is a specialized information retrieval task. In the emerging scenario of Dataset Search with Examples (DSE), the user submits a query and a few target datasets that are known to be relevant as examples. The retrieved datasets are expected to be relevant to the query and also similar to the target datasets. Distinguished from existing text-based retrievers, we propose a graph-based approach GraDaSE. Besides the textual metadata of the datasets, we identify their provenance-based and topic-based relationships to construct a graph, and jointly encode their structural and textual information for ranking candidate datasets. GraDaSE outperforms a variety of strong baselines on two test collections, including DataFinder-E that we construct.
Anthology ID:
2025.emnlp-main.353
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6932–6943
Language:
URL:
https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.353/
DOI:
Bibkey:
Cite (ACL):
Jing He, Mingyang Lv, Qing Shi, and Gong Cheng. 2025. GraDaSE: Graph-Based Dataset Search with Examples. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6932–6943, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
GraDaSE: Graph-Based Dataset Search with Examples (He et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/lei-li-partial-disambiguation/2025.emnlp-main.353.pdf
Checklist:
 2025.emnlp-main.353.checklist.pdf