How Much Semantic Information is Available in Large Language Model Tokens?

David A. Haslett, Zhenguang G. Cai


Abstract
Large language models segment many words into multiple tokens, and companies that make those models claim that meaningful subword tokens are essential. To investigate whether subword tokens bear meaning, we segmented tens of thousands of words from each of 41 languages according to three generations of GPT tokenizers. We found that words sharing tokens are more semantically similar than expected by chance or expected from length alone, that tokens capture morphological information even when they don’t look like morphemes, and that tokens capture more information than is explained by morphology. In languages that use a script other than the Latin alphabet, GPT-4 tokens are uninformative, but GPT-4o has improved this situation. These results suggest that comparing tokens to morphemes overlooks the wider variety of semantic information available in word form and that standard tokenization methods successfully capture much of that information.
Anthology ID:
2025.tacl-1.20
Volume:
Transactions of the Association for Computational Linguistics, Volume 13
Month:
Year:
2025
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
408–423
Language:
URL:
https://preview.aclanthology.org/corrections-2025-07/2025.tacl-1.20/
DOI:
10.1162/tacl_a_00747
Bibkey:
Cite (ACL):
David A. Haslett and Zhenguang G. Cai. 2025. How Much Semantic Information is Available in Large Language Model Tokens?. Transactions of the Association for Computational Linguistics, 13:408–423.
Cite (Informal):
How Much Semantic Information is Available in Large Language Model Tokens? (Haslett & Cai, TACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-07/2025.tacl-1.20.pdf