Tokenization as Finite-State Transduction

Marco Cognetta; Naoaki Okazaki

doi:10.1162/coli.a.23

Tokenization as Finite-State Transduction

Abstract

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework that can encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, are also efficiently representable by simple finite-state transducers. For BPE, this is particularly surprising given that it does not tokenize strings from left to right and requires a notion of priority. We also discuss an application of subword-level pattern promotion to guided generation, where the outputs of a language model are constrained to match a specified pattern, and how tokenization-aware promotion offers a theoretical benefit to modeling.

Anthology ID:: 2025.cl-4.2
Volume:: Computational Linguistics, Volume 51, Issue 4 - December 2025
Month:: December
Year:: 2025
Address:: Cambridge, MA
Venue:: CL
SIG:
Publisher:: MIT Press
Note:
Pages:: 1119–1149
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2025.cl-4.2/
DOI:: 10.1162/coli.a.23
Bibkey:
Cite (ACL):: Marco Cognetta and Naoaki Okazaki. 2025. Tokenization as Finite-State Transduction. Computational Linguistics, 51(4):1119–1149.
Cite (Informal):: Tokenization as Finite-State Transduction (Cognetta & Okazaki, CL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2025.cl-4.2.pdf

PDF Cite Search Fix data