Tokenization as Finite-State Transduction

Marco Cognetta, Naoaki Okazaki


Abstract
Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework that can encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, are also efficiently representable by simple finite-state transducers. For BPE, this is particularly surprising given that it does not tokenize strings from left to right and requires a notion of priority. We also discuss an application of subword-level pattern promotion to guided generation, where the outputs of a language model are constrained to match a specified pattern, and how tokenization-aware promotion offers a theoretical benefit to modeling.
Anthology ID:
2025.cl-4.2
Volume:
Computational Linguistics, Volume 51, Issue 4 - December 2025
Month:
December
Year:
2025
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
1119–1149
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2025.cl-4.2/
DOI:
10.1162/coli.a.23
Bibkey:
Cite (ACL):
Marco Cognetta and Naoaki Okazaki. 2025. Tokenization as Finite-State Transduction. Computational Linguistics, 51(4):1119–1149.
Cite (Informal):
Tokenization as Finite-State Transduction (Cognetta & Okazaki, CL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2025.cl-4.2.pdf