DataSciBench: An LLM Agent Benchmark for Data Science
Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue
Abstract
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Models (LLMs) in data science. Unlike existing benchmarks limited to single task, simple evaluation metrics, and readily available ground truth (GT), DataSciBench is built on curated, natural, and challenging prompts with complex evaluation criteria and uncertain GT. To bridge the gap, we develop a semi-automated GT generation pipeline, integrating LLM-based self-consistency and human verification to ensure accuracy, predefined task types, and aggregate functions (metrics). Furthermore, we introduce an innovative Intention-Function-Code (IFC) framework, assessing code execution outcomes through metrics and programmatic rules. Evaluating 26 models (8 API-based, 8 open-source general, 9 code generation, and 1 agentic models), our approach offers rigorous insights into LLM strengths and weaknesses. Experimental results show API-based models outperform open-source counterparts across all metrics, with DeepAnalyze-8B leading among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.- Anthology ID:
- 2026.findings-acl.181
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3685–3728
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.181/
- DOI:
- Cite (ACL):
- Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, and Yisong Yue. 2026. DataSciBench: An LLM Agent Benchmark for Data Science. In Findings of the Association for Computational Linguistics: ACL 2026, pages 3685–3728, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- DataSciBench: An LLM Agent Benchmark for Data Science (Zhang et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.181.pdf