Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer


Abstract
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses–the model’s confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of resopnses. However, it was not clear how well these approaches approximated users’ perceptions of consistency of LLM responses. To find out, we performed a user study (n=2,976) demonstrating that current methods for measuring LLM response consistency typically do not align well with humans’ perceptions of LLM consistency. We propose a logit-based ensemble method for estimating LLM consistency and show that our method matches the performance of the best-performing existing metric in estimating human ratings of LLM consistency. Our results suggest that methods for estimating LLM consistency without human evaluation are sufficiently imperfect to warrant broader use of evaluation with human input; this would avoid misjudging the adequacy of models because of the imperfections of automated consistency metrics.
Anthology ID:
2025.emnlp-main.1554
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30518–30532
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1554/
DOI:
10.18653/v1/2025.emnlp-main.1554
Bibkey:
Cite (ACL):
Xiaoyuan Wu, Weiran Lin, Omer Akgul, and Lujo Bauer. 2025. Estimating LLM Consistency: A User Baseline vs Surrogate Metrics. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30518–30532, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics (Wu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.emnlp-main.1554.pdf
Checklist:
 2025.emnlp-main.1554.checklist.pdf