Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping

Fan Jiang; Honglin Yu; Grace Chung; Trevor Cohn

Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping

Fan Jiang, Honglin Yu, Grace Y Chung, Trevor Cohn

Abstract

Extending Large Language Models (LLMs) to new languages is challenging, with most methods proposed suffering from high computational cost and catastrophic forgetting of original model capabilities. Embedding relearning (CITATION), a technique that creates new tokenizers and tunes embeddings on fixed model weights for target language adaptation, is both light-weight and performant. However, it has only been shown to work for older generation encoder-only models and for high resource languages. In this paper, we extend this framework to decoder-only LLMs focusing on joint adaptation to many languages, including low-resource ones. We experiment in three language groups over 100 languages each. We adapt a pre-trained LLM via switching to a customized tokenizer, and relearning the embedding layer. Across 96 diverse languages spanning both classification and generation tasks, we show embedding relearning improves models by up to 20%, being highly competitive with full-weight updating baselines while vastly more computationally efficient and mitigating catastrophic forgetting. This translates into better results in transferring the improved multilingual performance to tasks that build on core English abilities (e.g., multilingual math reasoning), compared to various baselines. Further analysis reveals the critical role of customizing tokenizers in achieving effective language transfer, particularly for non-Latin script languages.

Anthology ID:: 2026.eacl-long.357
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7606–7636
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.357/
DOI:
Bibkey:
Cite (ACL):: Fan Jiang, Honglin Yu, Grace Y Chung, and Trevor Cohn. 2026. Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7606–7636, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Tokenizer-Aware Cross-Lingual Adaptation of Decoder-Only LLMs through Embedding Relearning and Swapping (Jiang et al., EACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.eacl-long.357.pdf

PDF Cite Search Fix data