Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A

Christopher T. Franck, Amy Vennos, W. Graham Mueller, Daniel Dakota


Abstract
The power of Large Language Models (LLMs) in user workflows has increased the desire to access such technology in everyday work. While the ability to interact with models provides noticeable benefits, it also presents challenges in terms of how much trust a user should put in the system’s responses. This is especially true for external commercial and proprietary models where there is seldom direct access and only a response from an API is provided. While standard evaluation metrics, such as accuracy, provide starting points, they often may not provide enough information to users in settings where the confidence in a system’s response is important due to downstream or real-world impact, such as in Question & Answering (Q&A) workflows. To support users in assessing how accurate Q&A responses from such black-box LLMs scenarios are, we develop an uncertainty estimation framework that provides users with an analysis using a Dirichlet mixture model accessed from probabilities derived from a zero-shot classification model. We apply our framework to responses on the BoolQ Yes/No questions from GPT models, finding the resulting clusters allow a better quantification of uncertainty, providing a more fine-grained quantification of accuracy and precision across the space of model output while still being computationally practical. We further demonstrate its generalizability and reusability of the uncertainty model by applying it to a small set of Q&A collected from U.S. government websites.
Anthology ID:
2025.gem-1.29
Volume:
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:
July
Year:
2025
Address:
Vienna, Austria and virtual meeting
Editors:
Kaustubh Dhole, Miruna Clinciu
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
337–353
Language:
URL:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.29/
DOI:
Bibkey:
Cite (ACL):
Christopher T. Franck, Amy Vennos, W. Graham Mueller, and Daniel Dakota. 2025. Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 337–353, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Clustering Zero-Shot Uncertainty Estimations to Assess LLM Response Accuracy for Yes/No Q&A (Franck et al., GEM 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.29.pdf