Training Bilingual LMs with Data Constraints in the Targeted Language

Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, David Grangier


Abstract
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
Anthology ID:
2025.findings-acl.977
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19096–19122
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.977/
DOI:
Bibkey:
Cite (ACL):
Skyler Seto, Maartje Ter Hoeve, Richard He Bai, Natalie Schluter, and David Grangier. 2025. Training Bilingual LMs with Data Constraints in the Targeted Language. In Findings of the Association for Computational Linguistics: ACL 2025, pages 19096–19122, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Training Bilingual LMs with Data Constraints in the Targeted Language (Seto et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.977.pdf