Which Pieces Does Unigram Tokenization Really Need?

Sander Land; Yuval Pinter

Which Pieces Does Unigram Tokenization Really Need?

Abstract

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece package and adapters thereof. We bridge this gap between theory and practice by providing a clear guide to implementation and parameter choices. We also identify a simpler algorithm that accepts slightly higher training loss in exchange for improved compression.

Anthology ID:: 2026.findings-acl.316
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6351–6360
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.316/
DOI:
Bibkey:
Cite (ACL):: Sander Land and Yuval Pinter. 2026. Which Pieces Does Unigram Tokenization Really Need?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 6351–6360, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Which Pieces Does Unigram Tokenization Really Need? (Land & Pinter, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.316.pdf
Checklist:: 2026.findings-acl.316.checklist.pdf

PDF Cite Search Checklist Fix data