On the Proper Treatment of Units in Surprisal Theory
Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, Ryan Cotterell
Abstract
Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.- Anthology ID:
- 2026.acl-long.1485
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 32202–32224
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485/
- DOI:
- Cite (ACL):
- Samuel Kiegeland, V\'esteinn Sn{\ae}bjarnarson, Tim Vieira, and Ryan Cotterell. 2026. On the Proper Treatment of Units in Surprisal Theory. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32202–32224, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- On the Proper Treatment of Units in Surprisal Theory (Kiegeland et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.1485.pdf