Does Training on Synthetic Data Make Models Less Robust?

Lingze Zhang, Ellie Pavlick


Abstract
An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain “blindspots” by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our “blindspot” task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn’t necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.
Anthology ID:
2025.insights-1.8
Volume:
The Sixth Workshop on Insights from Negative Results in NLP
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Aleksandr Drozd, João Sedoc, Shabnam Tafreshi, Arjun Akula, Raphael Shu
Venues:
insights | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
79–85
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.insights-1.8/
DOI:
Bibkey:
Cite (ACL):
Lingze Zhang and Ellie Pavlick. 2025. Does Training on Synthetic Data Make Models Less Robust?. In The Sixth Workshop on Insights from Negative Results in NLP, pages 79–85, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Does Training on Synthetic Data Make Models Less Robust? (Zhang & Pavlick, insights 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.insights-1.8.pdf