Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang; Yao Lu; Maurice Weber; Max Ryabinin; David Ifeoluwa Adelani; Yihong Chen; Raphael Tang; Pontus Stenetorp

Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Ifeoluwa Adelani, Yihong Chen, Raphael Tang, Pontus Stenetorp

Abstract

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same can not be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that documents machine-translated from a high-quality English corpus, can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web corpus, into nine languages. resulting in a 1.7-trillion-token corpus, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this corpus. Across Non-English understanding and reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding fewer than 5% of TransWebLLM’s training tokens as domain-specific data for continued pretraining yields state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus and models under Open Source Initiative-approved licenses.

Anthology ID:: 2025.emnlp-main.1426
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28075–28095
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1426/
DOI:
Bibkey:
Cite (ACL):: Jiayi Wang, Yao Lu, Maurice Weber, Max Ryabinin, David Ifeoluwa Adelani, Yihong Chen, Raphael Tang, and Pontus Stenetorp. 2025. Multilingual Language Model Pretraining using Machine-translated Data. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28075–28095, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Multilingual Language Model Pretraining using Machine-translated Data (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1426.pdf
Checklist:: 2025.emnlp-main.1426.checklist.pdf

PDF Cite Search Checklist Fix data