Scaling Data-Constrained Language Models with Synthetic Data
Hirokazu Kiyomaru, Yusuke Oda, Takashi Kodama, Chaoran Liu, Daisuke Kawahara
Abstract
Large language models (LLMs) improve with more training data, but practical limits on data collection increasingly constrain further scaling. Advances in instruction-following LLMs have enabled controlled, high-quality text generation, making synthetic data a promising remedy. However, its effectiveness for pre-training non-English LLMs remains underexplored. We study this question for Japanese in a fixed token budget setting in which organic Japanese Web text constitutes only a small share, while far more organic English Web text and instruction-following LLMs capable of generating fluent Japanese are available. We compare three strategies to fill the data shortfall: generating synthetic Japanese text, repeating the limited Japanese Web text, and using English Web text. Experiments show that synthetic Japanese corpora outperform both baselines and approach the performance achieved when the entire token budget is filled with additional organic Japanese Web text.- Anthology ID:
- 2026.findings-eacl.52
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1002–1016
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.52/
- DOI:
- Cite (ACL):
- Hirokazu Kiyomaru, Yusuke Oda, Takashi Kodama, Chaoran Liu, and Daisuke Kawahara. 2026. Scaling Data-Constrained Language Models with Synthetic Data. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1002–1016, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- Scaling Data-Constrained Language Models with Synthetic Data (Kiyomaru et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.52.pdf