CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
Hanchong Zhang, Jieyu Li, Lu Chen, Ruisheng Cao, Yunyan Zhang, Yu Huang, Yefeng Zheng, Kai Yu
Abstract
The cross-domain text-to-SQL task aims to build a system that can parse user questions into SQL on complete unseen databases, and the single-domain text-to-SQL task evaluates the performance on identical databases. Both of these setups confront unavoidable difficulties in real-world applications. To this end, we introduce the cross-schema text-to-SQL task, where the databases of evaluation data are different from that in the training data but come from the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally consisted of 4,340 question/SQL pairs across 2 databases. In order to generalize models to different medical systems, we extend CSS and create 19 new databases along with 29,280 corresponding dataset examples. Moreover, CSS is also a large corpus for single-domain Chinese text-to-SQL studies. We present the data collection approach and a series of analyses of the data statistics. To show the potential and usefulness of CSS, benchmarking baselines have been conducted and reported. Our dataset is publicly available at https://huggingface.co/datasets/zhanghanchong/css.- Anthology ID:
- 2023.findings-acl.435
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6970–6983
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.435
- DOI:
- 10.18653/v1/2023.findings-acl.435
- Cite (ACL):
- Hanchong Zhang, Jieyu Li, Lu Chen, Ruisheng Cao, Yunyan Zhang, Yu Huang, Yefeng Zheng, and Kai Yu. 2023. CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6970–6983, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset (Zhang et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2023.findings-acl.435.pdf