On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland; Vésteinn Snæbjarnarson; Tim Vieira; Ryan Cotterell

On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, Ryan Cotterell

Abstract

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

Anthology ID:: 2026.acl-long.1485
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 32202–32224
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485/
DOI:
Bibkey:
Cite (ACL):: Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, and Ryan Cotterell. 2026. On the Proper Treatment of Units in Surprisal Theory. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32202–32224, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: On the Proper Treatment of Units in Surprisal Theory (Kiegeland et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485.pdf
Checklist:: 2026.acl-long.1485.checklist.pdf

PDF Cite Search Checklist Fix data