Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

James A. Michaelov, Reeka Estacio, Zhien Zhang, Ben Bergen


Abstract
Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models’ ability to do this is far from robust. In fact, under certain conditions, all models tested—including Llama 3, Gemma 2, and Mistral NeMo—perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as ‘the car was given a parking ticket by the brake’ than to merely unlikely sentences such as ‘the car was given a parking ticket by the explorer’.
Anthology ID:
2025.findings-acl.696
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13528–13551
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.696/
DOI:
Bibkey:
Cite (ACL):
James A. Michaelov, Reeka Estacio, Zhien Zhang, and Ben Bergen. 2025. Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events. In Findings of the Association for Computational Linguistics: ACL 2025, pages 13528–13551, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events (Michaelov et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.696.pdf