Ted-Tok: Maintaining an Evolving Vocabulary for Lifelong Learning

Jiameng Huang, Zhi Zhang, Zhenyu He, Jiacheng Sun, Di He


Abstract
Lifelong learning investigates how models adapt when exposed to a potentially infinite stream of data. Most conventional approaches focus on updating model parameters (i.e., the neural network weights) as the underlying data distribution evolves over time. However, in natural language processing, model parameters are not the only components that matter. The tokenizer, a foundational part of the system, is usually assumed to remain fixed in lifelong learning scenarios. In this work, we challenge the validity of this assumption: as language evolves, a static tokenizer fragments newly emerging lexical items, reducing compression efficiency and consequently degrading the model performance. We introduce the Temporal Drift Tokenizer (Ted-Tok), which maintains an evolving vocabulary that adapts to emerging linguistic patterns over time. This adaptivity is driven by time-weighted frequency estimators that smooth short-term fluctuations to capture persistent linguistic trends, and a principled addition-deletion strategy targeting sink tokens. Across multiple domains, Ted-Tok consistently improves compression and task performance, with gains increasing under stronger drift, underscoring the role of tokenizer adaptivity in lifelong learning.
Anthology ID:
2026.acl-long.394
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8706–8719
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.394/
DOI:
Bibkey:
Cite (ACL):
Jiameng Huang, Zhi Zhang, Zhenyu He, Jiacheng Sun, and Di He. 2026. Ted-Tok: Maintaining an Evolving Vocabulary for Lifelong Learning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8706–8719, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Ted-Tok: Maintaining an Evolving Vocabulary for Lifelong Learning (Huang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.394.pdf
Checklist:
 2026.acl-long.394.checklist.pdf