EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization
Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, Els Lefever
Abstract
Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.- Anthology ID:
- 2025.findings-acl.109
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2124–2137
- Language:
- URL:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109/
- DOI:
- Cite (ACL):
- Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, and Els Lefever. 2025. EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2124–2137, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization (Singh et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109.pdf