Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, Taro Watanabe


Abstract
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
Anthology ID:
2025.findings-acl.1333
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venues:
Findings | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25963–25976
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1333/
DOI:
Bibkey:
Cite (ACL):
Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, and Taro Watanabe. 2025. Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries. In Findings of the Association for Computational Linguistics: ACL 2025, pages 25963–25976, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries (Sakajo et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.1333.pdf