PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes

Xuming Hu, Chuan Lei, Xiao Qin, Asterios Katsifodimos, Christos Faloutsos, Huzefa Rangwala


Abstract
Given a query table, how can we effectively discover multi-key joinable tables on the web? This can be seen as a retrieval task, where users can lookup on the web for tables related to an existing one. Searching and discovering such joinable tables is critical to data analysts and data scientists for reporting, establishing correlations and training machine learning models. Existing joinable table search methods have mostly focused on single key (unary) joins, where a single column is the join key. However, these methods are ineffective when dealing with join keys composed of multiple columns (n-ary joins), which are prevalent on web table corpora. In this paper, we introduce PolyJoin, which finds multi-key semantically-joinable tables on the web, given a query table. PolyJoin employs a multi-key encoder and a novel self-supervised training method to generate the representations of multiple join keys, preserving the alignment across multiple columns. In particular, PolyJoin is equipped with a hierarchical contrastive learning technique to further enhance the model’s semantic understanding of multi-key joinable tables. PolyJoin outperforms the state-of-the-art methods by 2.89% and 3.67% with respect to MAP@30 and R@30 on two real-world web table benchmarks, respectively.
Anthology ID:
2025.findings-naacl.23
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
384–395
Language:
URL:
https://preview.aclanthology.org/Author-Pages-WenzhengZhang-ZhengyanShi-ShuYang/2025.findings-naacl.23/
DOI:
10.18653/v1/2025.findings-naacl.23
Bibkey:
Cite (ACL):
Xuming Hu, Chuan Lei, Xiao Qin, Asterios Katsifodimos, Christos Faloutsos, and Huzefa Rangwala. 2025. PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 384–395, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
PolyJoin: Semantic Multi-key Joinable Table Search in Data Lakes (Hu et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/Author-Pages-WenzhengZhang-ZhengyanShi-ShuYang/2025.findings-naacl.23.pdf