ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Wenqi Pei, Shizheng Hou, Boyan Li, Chen Han, Zhichao Shi, Yuyu Luo


Abstract
Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce **ROSE**, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user’s intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set **ROSE-VEC**, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen’s Kappa. We also conduct a large-scale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.
Anthology ID:
2026.acl-long.257
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5682–5709
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.257/
DOI:
Bibkey:
Cite (ACL):
Wenqi Pei, Shizheng Hou, Boyan Li, Chen Han, Zhichao Shi, and Yuyu Luo. 2026. ROSE: An Intent-Centered Evaluation Metric for NL2SQL. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5682–5709, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
ROSE: An Intent-Centered Evaluation Metric for NL2SQL (Pei et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.257.pdf
Checklist:
 2026.acl-long.257.checklist.pdf