Liang Yan


2025

pdf bib
DSQG-Syn: Synthesizing High-quality Data for Text-to-SQL Parsing by Domain Specific Question Generation
Shaoming Duan | Youxuan Wu | Chuanyi Liu | Yuhao Zhang | Zirui Wang | Peiyi Han | Shengyuan Yu | Liang Yan | Yingwei Liang
Findings of the Association for Computational Linguistics: NAACL 2025

Synthetic data has recently proven effective in enhancing the accuracy of Text-to-SQL parsers. However, existing methods generate SQL queries first by randomly sampling tables and columns based on probability and then synthesize natural language questions (NLQs). This approach often produces a large number of NLQ-SQL pairs that are irrelevant to the target domain and inconsistent in query intent, significantly diminishing the fine-tuning effectiveness of LLMs. In this paper, we introduce DSQG-Syn, a novel text-to-SQL data synthesis framework that based on domain-specific question generation. Specifically, we design a question generation method that creates domain-relevant questions based on predefined question types, ensuring coverage of major SQL operations. Guided by these questions, we synthesize NLQ-SQL pairs that are both domain-relevant and intent-consistent. To further enhance data quality, we filter out noisy samples from the generated pairs. When popular open-source LLMs are fine-tuned on our high-quality synthesized dataset, they achieve significant accuracy improvements, surpassing the performance of closed-source LLM-based approaches. Moreover, we demonstrate that our method outperforms existing state-of-the-art (SOTA) data synthesis techniques.