Word predictability estimates from language models are not robust to tokenizer vocabulary

Kien Nguyen; Suhas Arehalli

Word predictability estimates from language models are not robust to tokenizer vocabulary

Abstract

Much recent work has been interested in modeling language processing using measures of predictability estimated from pretrained language models. These models, however, are primarily built as language technologies rather than cognitive models, and make many design choices that may align poorly with theories of human language processing. We investigate one such choice — the size of the vocabulary learned by a BPE tokenizer — and investigate (1) its effect on the linguistic plausibility of subword units the model learns, (2) whether vocabulary size has a substantial influence on the surprisal estimates a model generates, and (3) whether those differences in surprisal translate to differences in the quality of downstream reading time predictions. We find that while vocabulary size doesn’t substantially affect the rate of morphologically reasonable tokenizations, it does have an impact on surprisal estimates and reading time predictions from 5-gram, LSTM, and GPT-2 language models. Moreover, we find that these differences primarily affect words that are split by the tokenizer, suggesting that psycholinguists should take care to design stimuli meant for computational modeling with subword tokenization in mind.

Anthology ID:: 2026.conll-main.3
Volume:: Proceedings of the 30th Conference on Computational Natural Language Learning
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Claire Bonial, Yevgeni Berzak
Venues:: CoNLL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34–44
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.3/
DOI:
Bibkey:
Cite (ACL):: Kien Nguyen and Suhas Arehalli. 2026. Word predictability estimates from language models are not robust to tokenizer vocabulary. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 34–44, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Word predictability estimates from language models are not robust to tokenizer vocabulary (Nguyen & Arehalli, CoNLL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.3.pdf

PDF Cite Search Fix data