Synthetic Data in the Era of Large Language Models
Vijay Viswanathan, Xiang Yue, Alisa Liu, Yizhong Wang, Graham Neubig
Abstract
Progress in natural language processing has historically been driven by better data, and researchers today are increasingly using ‘synthetic data’ - data generated with the assistance of large language models - to make dataset construction faster and cheaper. However, most synthetic data generation approaches are executed in an ad hoc manner and ‘reinvent the wheel’ rather than build on prior foundations. This tutorial seeks to build a shared understanding of recent progress in synthetic data generation from NLP and related fields by grouping and describing major methods, applications, and open problems. Our tutorial will be divided into four main sections. First, we will describe algorithms for producing high-quality synthetic data. Second, we will describe how synthetic data can be used to advance the general-purpose development and study of language models. Third, we will demonstrate how to customize synthetic data generation to support scenario-specific applications. Finally, we will discuss open questions about the production and use of synthetic data that must be answered to overcome some of their current limitations. Our goal is that by unifying recent advances in this emerging research direction, we can build foundations upon which the community can improve the rigor, understanding, and effectiveness of synthetic data moving forward.- Anthology ID:
- 2025.acl-tutorials.7
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Yuki Arase, David Jurgens, Fei Xia
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11–12
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-tutorials.7/
- DOI:
- Cite (ACL):
- Vijay Viswanathan, Xiang Yue, Alisa Liu, Yizhong Wang, and Graham Neubig. 2025. Synthetic Data in the Era of Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts), pages 11–12, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- Synthetic Data in the Era of Large Language Models (Viswanathan et al., ACL 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.acl-tutorials.7.pdf