Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Dan John Velasco; Matthew Theodore Roque

Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text

Dan John Velasco, Matthew Theodore Roque

Abstract

Most languages lack sufficient data for largescale monolingual pretraining, creating a “data wall.” Multilingual pretraining helps but is limited by language imbalance and the “curse of multilinguality.” An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil—two typologically distant, lowerresource languages—and pretraining GPT-2 models (124M–774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.

Anthology ID:: 2025.mrl-main.40
Volume:: Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Month:: November
Year:: 2025
Address:: Suzhuo, China
Editors:: David Ifeoluwa Adelani, Catherine Arnett, Duygu Ataman, Tyler A. Chang, Hila Gonen, Rahul Raja, Fabian Schmidt, David Stap, Jiayi Wang
Venues:: MRL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 612–630
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.40/
DOI:
Bibkey:
Cite (ACL):: Dan John Velasco and Matthew Theodore Roque. 2025. Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 612–630, Suzhuo, China. Association for Computational Linguistics.
Cite (Informal):: Scaling, Simplification, and Adaptation: Lessons from Pretraining on Machine-Translated Text (Velasco & Roque, MRL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.mrl-main.40.pdf

PDF Cite Search Fix data