Evaluating Large Language Models for Confidence-based Check Set Selection

Jane Arleth dela Cruz; Iris Hendrickx; Martha Larson

doi:10.18653/v1/2025.findings-acl.836

Evaluating Large Language Models for Confidence-based Check Set Selection

Jane Arleth dela Cruz, Iris Hendrickx, Martha Larson

Abstract

Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios faces two key challenges: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly.We investigate commonly used off-the-shelf LLMs’ ability to identify low-confidence outputs for human review through “check set selection”–a process where LLMs prioritize information needing human judgment.Using a case study on social media monitoring for disaster risk management,we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort.We test two strategies for LLM check set selection: *individual confidence elicitation* – LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and *direct set confidence elicitation* – LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts.Our results reveal that set selection via individual probabilities is more reliable but that direct set confidence merits further investigation.Direct set selection challenges include inconsistent outputs, incorrect check set size, and low inter-annotator agreement. Despite these challenges, our approach improves collaborative disaster tweet classification by outperforming random-sample check set selection, demonstrating the potential of human-LLM collaboration.

Anthology ID:: 2025.findings-acl.836
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16249–16265
Language:
URL:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.836/
DOI:: 10.18653/v1/2025.findings-acl.836
Bibkey:
Cite (ACL):: Jane Arleth dela Cruz, Iris Hendrickx, and Martha Larson. 2025. Evaluating Large Language Models for Confidence-based Check Set Selection. In Findings of the Association for Computational Linguistics: ACL 2025, pages 16249–16265, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Evaluating Large Language Models for Confidence-based Check Set Selection (dela Cruz et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/mtsummit-25-ingestion/2025.findings-acl.836.pdf

PDF Cite Search Fix data