On the Proper Treatment of Units in Surprisal Theory

Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, Ryan Cotterell


Abstract
Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.
Anthology ID:
2026.acl-long.1485
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
32202–32224
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485/
DOI:
Bibkey:
Cite (ACL):
Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, and Ryan Cotterell. 2026. On the Proper Treatment of Units in Surprisal Theory. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32202–32224, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
On the Proper Treatment of Units in Surprisal Theory (Kiegeland et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485.pdf
Checklist:
 2026.acl-long.1485.checklist.pdf