Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

Ye-Eun Cho


Abstract
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models’ internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model–condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
Anthology ID:
2026.codi-1.17
Volume:
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Chloé Braud, Christian Hardmeier, Maciej Ogrodniczuk, Sharid Loaiciga, Amir Zeldes, Michal Novák, Chuyuan Li, Michael Strube, Junyi Jessy Li
Venues:
CODI | CRAC | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
120–129
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.codi-1.17/
DOI:
Bibkey:
Cite (ACL):
Ye-Eun Cho. 2026. Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity. In Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026), pages 120–129, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity (Cho, CODI-CRAC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.codi-1.17.pdf
Supplementarymaterial:
 2026.codi-1.17.SupplementaryMaterial.zip