One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing, Lennart Keller, Goran Glava\v{s}


Abstract
Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: **(1)** transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or **(2)** between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for *pretraining* general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: **(i)** loss of script-specific information and **(ii)** dilution of language-specific representations from increased subword overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap dilutes language-specific representations. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.
Anthology ID:
2026.findings-acl.1909
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38291–38307
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1909/
DOI:
Bibkey:
Cite (ACL):
Benedikt Ebing, Lennart Keller, and Goran Glava\v{s}. 2026. One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38291–38307, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models (Ebing et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1909.pdf
Checklist:
 2026.findings-acl.1909.checklist.pdf