SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models
Bo Zhang, Cong Gao, Linkang Yang, Bingxu Han, Minghao Hu, Zhunchen Luo, Guotong Geng, Xiaoying Bai, Jun Zhang, Wen Yao, Zhong Wang
Abstract
Large language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose SafeConf, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86% and 7.79% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.- Anthology ID:
- 2025.findings-emnlp.186
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3483–3495
- Language:
- URL:
- https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.186/
- DOI:
- 10.18653/v1/2025.findings-emnlp.186
- Cite (ACL):
- Bo Zhang, Cong Gao, Linkang Yang, Bingxu Han, Minghao Hu, Zhunchen Luo, Guotong Geng, Xiaoying Bai, Jun Zhang, Wen Yao, and Zhong Wang. 2025. SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 3483–3495, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models (Zhang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.186.pdf