Should you marginalize over possible tokenizations?

Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman

[How to correct problems with metadata yourself]


Abstract
Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.
Anthology ID:
2023.acl-short.1
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–12
Language:
URL:
https://aclanthology.org/2023.acl-short.1
DOI:
10.18653/v1/2023.acl-short.1
Bibkey:
Cite (ACL):
Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, and Marc Dymetman. 2023. Should you marginalize over possible tokenizations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–12, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Should you marginalize over possible tokenizations? (Chirkova et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/teach-a-man-to-fish/2023.acl-short.1.pdf
Video:
 https://preview.aclanthology.org/teach-a-man-to-fish/2023.acl-short.1.mp4