Maurice Weber


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang | Yao Lu | Maurice Weber | Max Ryabinin | David Ifeoluwa Adelani | Yihong Chen | Raphael Tang | Pontus Stenetorp
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). However, the same can not be said for most other languages, likely due to a gap in the quality and diversity of available multilingual pretraining corpora. In this work, we find that documents machine-translated from a high-quality English corpus, can contribute significantly to the pretraining quality of multilingual LLMs. Concretely, we translate FineWeb-Edu, a high-quality English web corpus, into nine languages. resulting in a 1.7-trillion-token corpus, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this corpus. Across Non-English understanding and reasoning tasks, we show that TransWebLLM matches or even outperforms multilingual LLMs of similar size, including Llama3.2, Qwen2.5, and Gemma3, despite being trained on an order of magnitude less data. Moreover, we show that adding fewer than 5% of TransWebLLM’s training tokens as domain-specific data for continued pretraining yields state-of-the-art results in Arabic, Indonesian, Swahili, and Welsh for understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus and models under Open Source Initiative-approved licenses.