David A. Haslett
2025
Tokenization Changes Meaning in Large Language Models: Evidence from Chinese
David A. Haslett
Computational Linguistics, Volume 51, Issue 3 - September 2025
David A. Haslett
Computational Linguistics, Volume 51, Issue 3 - September 2025
Large language models segment many words into multiple tokens, and there is mixed evidence as to whether tokenization affects how state-of-the-art models represent meanings. Chinese characters present an opportunity to investigate this issue: They contain semantic radicals, which often convey useful information; characters with the same semantic radical tend to begin with the same one or two bytes (when using UTF-8 encodings); and tokens are common strings of bytes, so characters with the same radical often begin with the same token. This study asked GPT-4, GPT-4o, and Llama 3 whether characters contain the same semantic radical, elicited semantic similarity ratings, and conducted odd-one-out tasks (i.e., which character is not like the others). In all cases, misalignment between tokens and radicals systematically corrupted representations of Chinese characters. In experiments comparing characters represented by single tokens to multi-token characters, the models were less accurate for single-token characters, which suggests that segmenting words into fewer, longer tokens obscures valuable information in word form and will not resolve the problems introduced by tokenization. In experiments with 12 European languages, misalignment between tokens and suffixes systematically corrupted categorization of words by all three models, which suggests that the tendency to treat malformed tokens like linguistic units is pervasive.
Human-likeness of LLMs in the Mental Lexicon
Bei Xiao | Xufeng Duan | David A. Haslett | Zhenguang Cai
Proceedings of the 29th Conference on Computational Natural Language Learning
Bei Xiao | Xufeng Duan | David A. Haslett | Zhenguang Cai
Proceedings of the 29th Conference on Computational Natural Language Learning
Recent research has increasingly focused on the extent to which large language models (LLMs) exhibit human-like behavior. In this study, we investigate whether the mental lexicon in LLMs resembles that of humans in terms of lexical organization. Using a word association task—a direct and widely used method for probing word meaning and relationships in the human mind—we evaluated the lexical representations of GPT-4 and Llama-3.1. Our findings reveal that LLMs closely emulate human mental lexicons in capturing semantic relatedness but exhibit notable differences in other properties, such as association frequency and dominant lexical patterns (e.g., top associates). Specifically, LLM lexicons demonstrate greater clustering and reduced diversity compared to the human lexicon, with KL divergence analysis confirming significant deviations in word association patterns. Additionally, LLMs fail to fully capture word association response patterns in different demographic human groups. Among the models, GPT-4 consistently exhibited a slightly higher degree of human-likeness than Llama-3.1. This study highlights both the potential and limitations of LLMs in replicating human mental lexicons, offering valuable insights for applications in natural language processing and cognitive science research involving LLMs.
How Much Semantic Information is Available in Large Language Model Tokens?
David A. Haslett | Zhenguang G. Cai
Transactions of the Association for Computational Linguistics, Volume 13
David A. Haslett | Zhenguang G. Cai
Transactions of the Association for Computational Linguistics, Volume 13
Large language models segment many words into multiple tokens, and companies that make those models claim that meaningful subword tokens are essential. To investigate whether subword tokens bear meaning, we segmented tens of thousands of words from each of 41 languages according to three generations of GPT tokenizers. We found that words sharing tokens are more semantically similar than expected by chance or expected from length alone, that tokens capture morphological information even when they don’t look like morphemes, and that tokens capture more information than is explained by morphology. In languages that use a script other than the Latin alphabet, GPT-4 tokens are uninformative, but GPT-4o has improved this situation. These results suggest that comparing tokens to morphemes overlooks the wider variety of semantic information available in word form and that standard tokenization methods successfully capture much of that information.