Feiyang Kang


2025

pdf bib
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
Feiyang Kang | Newsha Ardalani | Michael Kuchnik | Youssef Emad | Mostafa Elhoushi | Shubhabrata Sengupta | Shang-Wen Li | Ramya Raghavendra | Ruoxi Jia | Carole-Jean Wu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations.We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data alone is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data alone results in notably higher loss on many downstream domains especially at small data budgets. “Good” ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on “model collapse” during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse”. Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

2024

pdf bib
FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation
Si Chen | Feiyang Kang | Ning Yu | Ruoxi Jia
Findings of the Association for Computational Linguistics: EMNLP 2024

Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100% improvement in F1 score over the state-of-the-art methods while being x33 faster than TracIn.