Self-Vocabularizing Training for Neural Machine Translation

Pin-Jie Lin, Ernie Chang, Yangyang Shi, Vikas Chandra


Abstract
Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training.Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary.In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training—where each iteration generates a labeled dataset by pairing source sentences with the model’s predictions to define a new vocabulary.Building on these insights, we propose *self-vocabularizing training*, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6–8% reduction in vocabulary size.
Anthology ID:
2025.naacl-srw.16
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
April
Year:
2025
Address:
Albuquerque, USA
Editors:
Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
171–177
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.naacl-srw.16/
DOI:
10.18653/v1/2025.naacl-srw.16
Bibkey:
Cite (ACL):
Pin-Jie Lin, Ernie Chang, Yangyang Shi, and Vikas Chandra. 2025. Self-Vocabularizing Training for Neural Machine Translation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 171–177, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
Self-Vocabularizing Training for Neural Machine Translation (Lin et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.naacl-srw.16.pdf