Peiyi Han
2025
DSQG-Syn: Synthesizing High-quality Data for Text-to-SQL Parsing by Domain Specific Question Generation
Shaoming Duan
|
Youxuan Wu
|
Chuanyi Liu
|
Yuhao Zhang
|
Zirui Wang
|
Peiyi Han
|
Shengyuan Yu
|
Liang Yan
|
Yingwei Liang
Findings of the Association for Computational Linguistics: NAACL 2025
Synthetic data has recently proven effective in enhancing the accuracy of Text-to-SQL parsers. However, existing methods generate SQL queries first by randomly sampling tables and columns based on probability and then synthesize natural language questions (NLQs). This approach often produces a large number of NLQ-SQL pairs that are irrelevant to the target domain and inconsistent in query intent, significantly diminishing the fine-tuning effectiveness of LLMs. In this paper, we introduce DSQG-Syn, a novel text-to-SQL data synthesis framework that based on domain-specific question generation. Specifically, we design a question generation method that creates domain-relevant questions based on predefined question types, ensuring coverage of major SQL operations. Guided by these questions, we synthesize NLQ-SQL pairs that are both domain-relevant and intent-consistent. To further enhance data quality, we filter out noisy samples from the generated pairs. When popular open-source LLMs are fine-tuned on our high-quality synthesized dataset, they achieve significant accuracy improvements, surpassing the performance of closed-source LLM-based approaches. Moreover, we demonstrate that our method outperforms existing state-of-the-art (SOTA) data synthesis techniques.
2024
Enhancing Text-to-SQL Parsing through Question Rewriting and Execution-Guided Refinement
Wenxin Mao
|
Ruiqi Wang
|
Jiyu Guo
|
Jichuan Zeng
|
Cuiyun Gao
|
Peiyi Han
|
Chuanyi Liu
Findings of the Association for Computational Linguistics: ACL 2024
Large Language Model (LLM)-based approach has become the mainstream for Text-to-SQL task and achieves remarkable performance. In this paper, we augment the existing prompt engineering methods by exploiting the database content and execution feedback. Specifically, we introduce DART-SQL, which comprises two key components: (1) Question Rewriting: DART-SQL rewrites natural language questions by leveraging database content information to eliminate ambiguity. (2) Execution-Guided Refinement: DART-SQL incorporates database content information and utilizes the execution results of the generated SQL to iteratively refine the SQL. We apply this framework to the two LLM-based approaches (DAIL-SQL and C3) and test it on four widely used benchmarks (Spider-dev, Spider-test, Realistic and DK). Experiments show that our framework for DAIL-SQL and C3 achieves an average improvement of 12.41% and 5.38%, respectively, in terms of execution accuracy(EX) metric.
Search
Fix data
Co-authors
- Chuanyi Liu 2
- Shaoming Duan 1
- Cuiyun Gao 1
- Jiyu Guo 1
- Yingwei Liang 1
- show all...