LangSAMP: Language-Script Aware Multilingual Pretraining

Yihong Liu, Haotian Ye, Chunlan Ma, Mingyang Wang, Hinrich Schuetze


Abstract
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at https://github.com/cisnlp/LangSAMP.
Anthology ID:
2025.acl-long.88
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1743–1770
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.88/
DOI:
Bibkey:
Cite (ACL):
Yihong Liu, Haotian Ye, Chunlan Ma, Mingyang Wang, and Hinrich Schuetze. 2025. LangSAMP: Language-Script Aware Multilingual Pretraining. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1743–1770, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
LangSAMP: Language-Script Aware Multilingual Pretraining (Liu et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.88.pdf