Scaling Data-Constrained Language Models with Synthetic Data

Hirokazu Kiyomaru; Yusuke Oda; Takashi Kodama; Chaoran Liu; Daisuke Kawahara

Scaling Data-Constrained Language Models with Synthetic Data

Hirokazu Kiyomaru, Yusuke Oda, Takashi Kodama, Chaoran Liu, Daisuke Kawahara

Abstract

Large language models (LLMs) improve with more training data, but practical limits on data collection increasingly constrain further scaling. Advances in instruction-following LLMs have enabled controlled, high-quality text generation, making synthetic data a promising remedy. However, its effectiveness for pre-training non-English LLMs remains underexplored. We study this question for Japanese in a fixed token budget setting in which organic Japanese Web text constitutes only a small share, while far more organic English Web text and instruction-following LLMs capable of generating fluent Japanese are available. We compare three strategies to fill the data shortfall: generating synthetic Japanese text, repeating the limited Japanese Web text, and using English Web text. Experiments show that synthetic Japanese corpora outperform both baselines and approach the performance achieved when the entire token budget is filled with additional organic Japanese Web text.

Anthology ID:: 2026.findings-eacl.52
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1002–1016
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.52/
DOI:
Bibkey:
Cite (ACL):: Hirokazu Kiyomaru, Yusuke Oda, Takashi Kodama, Chaoran Liu, and Daisuke Kawahara. 2026. Scaling Data-Constrained Language Models with Synthetic Data. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1002–1016, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Scaling Data-Constrained Language Models with Synthetic Data (Kiyomaru et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.52.pdf
Checklist:: 2026.findings-eacl.52.checklist.pdf

PDF Cite Search Checklist Fix data