EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization

Pranaydeep Singh; Eneko Agirre; Gorka Azkune; Orphee De Clercq; Els Lefever

EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization

Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, Els Lefever

Abstract

Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.

Anthology ID:: 2025.findings-acl.109
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2124–2137
Language:
URL:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109/
DOI:
Bibkey:
Cite (ACL):: Pranaydeep Singh, Eneko Agirre, Gorka Azkune, Orphee De Clercq, and Els Lefever. 2025. EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization. In Findings of the Association for Computational Linguistics: ACL 2025, pages 2124–2137, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization (Singh et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/acl25-workshop-ingestion/2025.findings-acl.109.pdf

PDF Cite Search Fix data