Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation
Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, Wanxiang Che
Abstract
Chart2code has recently received significant attention in the multimodal community due to its potential to reduce the burden of visualization and promote a more detailed understanding of charts. However, existing Chart2code-related training datasets suffer from at least one of the following issues: (1) limited scale, (2) limited type coverage, and (3) inadequate complexity. To address these challenges, we seek more diverse sources that better align with real-world user distributions and propose dual data synthesis pipelines: (1) synthesize based on online plotting code. (2) synthesize based on chart images in the academic paper. We create a large-scale Chart2code training dataset Chart2code53, including 53 chart types, 130K Chart-code pairs based on the pipeline. Experimental results demonstrate that even with few parameters, the model finetuned on Chart2code53 achieves state-of-the-art performance on multiple Chart2code benchmarks within open-source models.- Anthology ID:
- 2025.emnlp-main.799
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15839–15855
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.799/
- DOI:
- Cite (ACL):
- Tianhao Niu, Yiming Cui, Baoxin Wang, Xiao Xu, Xin Yao, Qingfu Zhu, Dayong Wu, Shijin Wang, and Wanxiang Che. 2025. Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15839–15855, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation (Niu et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.799.pdf