SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Bo Zhang (波章,); Cong Gao; Linkang Yang; Bingxu Han; Minghao Hu; Zhunchen Luo; Guotong Geng; Xiaoying Bai; Jun Zhang; Wen Yao; Zhong Wang

doi:10.18653/v1/2025.findings-emnlp.186

SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Bo Zhang, Cong Gao, Linkang Yang, Bingxu Han, Minghao Hu, Zhunchen Luo, Guotong Geng, Xiaoying Bai, Jun Zhang, Wen Yao, Zhong Wang

Abstract

Large language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose SafeConf, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86% and 7.79% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.

Anthology ID:: 2025.findings-emnlp.186
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3483–3495
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.186/
DOI:: 10.18653/v1/2025.findings-emnlp.186
Bibkey:
Cite (ACL):: Bo Zhang, Cong Gao, Linkang Yang, Bingxu Han, Minghao Hu, Zhunchen Luo, Guotong Geng, Xiaoying Bai, Jun Zhang, Wen Yao, and Zhong Wang. 2025. SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3483–3495, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.186.pdf
Checklist:: 2025.findings-emnlp.186.checklist.pdf

PDF Cite Search Checklist Fix data