EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization

Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, Els Lefever


Abstract
Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.
Anthology ID:
2025.findings-acl.109
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2124–2137
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109/
DOI:
Bibkey:
Cite (ACL):
Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, and Els Lefever. 2025. EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2124–2137, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization (Singh et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109.pdf