Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering; Michael Krumdick; Viet Dac Lai; Varshini Reddy; Seth Ebner; Nilesh Kumar; Rik Koncel-Kedziorski; Chris Tanner

Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering, Michael Krumdick, Viet Dac Lai, Varshini Reddy, Seth Ebner, Nilesh Kumar, Rik Koncel-Kedziorski, Chris Tanner

Abstract

Some statements have one well-defined continuation (e.g., “the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., “the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, ‘gpt-4o-mini‘ often picks the first of two options presented in the prompt regardless of the options’ implied likelihoods, whereas ‘Llama-3.1-8B‘ picks the second. Models do not allocate probability mass among valid options in a calibrated manner.

Anthology ID:: 2025.acl-long.1417
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 29218–29257
Language:
URL:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1417/
DOI:
Bibkey:
Cite (ACL):: Charles Lovering, Michael Krumdick, Viet Dac Lai, Varshini Reddy, Seth Ebner, Nilesh Kumar, Rik Koncel-Kedziorski, and Chris Tanner. 2025. Language Model Probabilities are Not Calibrated in Numeric Contexts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29218–29257, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Language Model Probabilities are Not Calibrated in Numeric Contexts (Lovering et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1417.pdf

PDF Cite Search Fix data