Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics

Lorenzo Jaime Yu Flores; Ori Ernst; Jackie CK Cheung

Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics

Lorenzo Jaime Yu Flores, Ori Ernst, Jackie CK Cheung

Abstract

Well-calibrated model confidence scores can improve the usefulness of text generation models. For example, users can be prompted to review predictions with low confidence scores, to prevent models from returning bad or potentially dangerous predictions. However, confidence metrics are not always well calibrated in text generation. One reason is that in generation, there can be many valid answers, which previous methods do not always account for. Hence, a confident model could assign probability to many sequences because they are all valid, and not because it is unsure about how to perform the task. We propose task-agnostic confidence metrics suited to generation, which rely solely on model probabilities without the need for further fine-tuning or heuristics. Using these, we are able to improve the calibration of BART and Flan-T5 on summarization, translation, and question answering datasets.

Anthology ID:: 2025.acl-short.15
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 172–182
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.acl-short.15/
DOI:
Bibkey:
Cite (ACL):: Lorenzo Jaime Yu Flores, Ori Ernst, and Jackie CK Cheung. 2025. Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 172–182, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Improving the Calibration of Confidence Scores in Text Generation Using the Output Distribution’s Characteristics (Flores et al., ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.acl-short.15.pdf

PDF Cite Search Fix data