FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluwase Owodunni; Orevaoghene Ahia; Sachin Kumar

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Abraham Toluwase Owodunni, Orevaoghene Ahia, Sachin Kumar

Abstract

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% point improvements on token classification and generative tasks compared to BPE and other gradient-based tokenizer baselines. We validate our findings using models of varying sizes, and our method demonstrates consistent improvements across scales.

Anthology ID:: 2026.findings-acl.848
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17170–17190
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.848/
DOI:
Bibkey:
Cite (ACL):: Abraham Toluwase Owodunni, Orevaoghene Ahia, and Sachin Kumar. 2026. FLEXITOKENS: Flexible Tokenization for Evolving Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 17170–17190, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: FLEXITOKENS: Flexible Tokenization for Evolving Language Models (Owodunni et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.848.pdf
Checklist:: 2026.findings-acl.848.checklist.pdf

PDF Cite Search Checklist Fix data