Evaluating NL2SQL via SQL2NL

Mohammadtaher Safarzadeh; Afshin Oroojlooy; Dan Roth

doi:10.18653/v1/2025.findings-emnlp.1031

Evaluating NL2SQL via SQL2NL

Mohammadtaher Safarzadeh, Afshin Oroojlooy, Dan Roth

Abstract

Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Ouranalysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain- highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.

Anthology ID:: 2025.findings-emnlp.1031
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18954–18968
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1031/
DOI:: 10.18653/v1/2025.findings-emnlp.1031
Bibkey:
Cite (ACL):: Mohammadtaher Safarzadeh, Afshin Oroojlooy, and Dan Roth. 2025. Evaluating NL2SQL via SQL2NL. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18954–18968, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Evaluating NL2SQL via SQL2NL (Safarzadeh et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.1031.pdf
Checklist:: 2025.findings-emnlp.1031.checklist.pdf

PDF Cite Search Checklist Fix data