Jinhang Su

2026

Traditional tabular data synthesis methods often overlook the cross-modal heterogeneity of real-world tables, where structured continuous and discrete attributes coexist with unstructured long-text columns. Existing synthesis approaches struggle to simultaneously achieve accurate statistical fidelity for non-textual attributes and consistent semantic constraints between textual and non-textual attributes. In this work, we establish the first benchmark for long-text tabular data synthesis and introduce a novel metric, termed Textual Column Correlation Fidelity (TCCF), to quantify cross-modal semantic alignment. We propose AFT-Tab, an adversarial fine-tuning framework that synergistically trains an LLM-based text generator and a deep-learning-based non-textual generator. Through a dual-feedback mechanism guided by an LLM discriminator, AFT-Tab ensures both precise statistical distributions and rigorous semantic constraints. Experimental results show that AFT-Tab significantly outperforms state-of-the-art baselines in statistical fidelity, TCCF, diversity, and downstream task utility.

2025

pdf bib abs

SPFT-SQL: Enhancing Large Language Model for Text-to-SQL Parsing by Self-Play Fine-Tuning
Yuhao Zhang | Shaoming Duan | Jinhang Su | Chuanyi Liu | Peiyi Han
Findings of the Association for Computational Linguistics: EMNLP 2025

Despite the significant advancements of self-play fine-tuning (SPIN), which can transform a weak large language model (LLM) into a strong one through competitive interactions between models of varying capabilities, it still faces challenges in the Text-to-SQL task. SPIN does not generate new information, and the large number of correct SQL queries produced by the opponent model during self-play reduces the main model’s ability to generate accurate SQL queries. To address this challenge, we propose a new self-play fine-tuning method tailored for the Text-to-SQL task, called SPFT-SQL. Prior to self-play, we introduce a verification-based iterative fine-tuning approach, which synthesizes high-quality fine-tuning data iteratively based on the database schema and validation feedback to enhance model performance, while building a model base with varying capabilities. During the self-play fine-tuning phase, we propose an error-driven loss method that incentivizes incorrect outputs from the opponent model, enabling the main model to distinguish between correct SQL and erroneous SQL generated by the opponent model, thereby improving its ability to generate correct SQL. Extensive experiments and in-depth analyses on six open-source LLMs and five widely used benchmarks demonstrate that our approach outperforms existing state-of-the-art (SOTA) methods.

Co-authors

Yuhao Zhang 1

Venues

ACL1
Findings1

Fix author