SQUiD: Synthesizing Relational Databases from Unstructured Text

Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury


Abstract
Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets. Our code and datasets are publicly available at: https://github.com/Mushtari-Sadia/SQUiD.
Anthology ID:
2025.emnlp-main.1629
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
31975–32000
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.1629/
DOI:
10.18653/v1/2025.emnlp-main.1629
Bibkey:
Cite (ACL):
Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, and Amrita Roy Chowdhury. 2025. SQUiD: Synthesizing Relational Databases from Unstructured Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31975–32000, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SQUiD: Synthesizing Relational Databases from Unstructured Text (Sadia et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.emnlp-main.1629.pdf
Checklist:
 2025.emnlp-main.1629.checklist.pdf