Yuying Li
2026
RealChart2Code: Bridging the Gap in Real-World Chart-to-Code Generation via Multi-Task Evaluation
Jiajun Zhang | Yuying Li | Zhixun Li | Xingyu Guo | Jingzhuo Wu | Leqi Zheng | Yiran Yang | Jianke Zhang | Qingbin Li | Shannan Yan | Changguo Jia | Junfei Wu | Zilei Wang | Qiang Liu | Liang Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiajun Zhang | Yuying Li | Zhixun Li | Xingyu Guo | Jingzhuo Wu | Leqi Zheng | Yiran Yang | Jianke Zhang | Qingbin Li | Shannan Yan | Changguo Jia | Junfei Wu | Zilei Wang | Qiang Liu | Liang Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Linzhuang Sun | Tianyu Guo | Hao Liang | Ruitong Liu | Yuying Li | Qifeng Cai | Jingxuan Wei | Yuchen Wu | Bihui Yu | Xiangxiang Zhang | Wentao Zhang | Bin Cui
Findings of the Association for Computational Linguistics: ACL 2026
Linzhuang Sun | Tianyu Guo | Hao Liang | Ruitong Liu | Yuying Li | Qifeng Cai | Jingxuan Wei | Yuchen Wu | Bihui Yu | Xiangxiang Zhang | Wentao Zhang | Bin Cui
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in Large Language Models (LLMs) have revolutionized Text-to-SQL parsing, achieving remarkable success in static, single-turn query generation. However, a significant disparity remains between these academic benchmarks and real-world utility. In practical applications, such as financial auditing or business analytics, user intents are rarely static; they evolve dynamically through iterative refinement, necessitating not just information retrieval (SELECT) but continuous state manipulation (INSERT, UPDATE, DELETE). To bridge this gap, we introduce DySQL-Bench, a novel benchmark designed to rigorously evaluate LLMs within a dynamic interaction framework. Unlike varying manual curation efforts, DySQL-Bench employs a two-stage automated synthesis pipeline: transforming raw relational schemas into hierarchical logic trees to generate user-database interactions, followed by a rigorous verify-and-refine protocol that ensures 100% distinct correctness via human expert validation. We further propose an interactive evaluation environment simulating a triadic workflow involving an LLM-simulated user, the agent under test, and an executable database system. Spanning 13 diverse domains with 1,072 complex tasks, our experiments reveal that current powerful models struggle in this realistic setting. Notably, GPT-4o achieves only 58.34% overall accuracy and a meager 23.81% on the strict Pass^5 metric, highlighting the substantial challenges DySQL-Bench poses for the future of database agents.