Cong Gao

2026

Although large reasoning models (LRMs) exhibit exceptional mathematical reasoning capabilities on clean inputs, their reasoning accuracy drops substantially in the presence of character-level noise such as typographical errors. Critically, their confidence estimates fail to reflect the corresponding decline in reasoning accuracy. While confidence calibration offers a principled solution, existing methods predominantly target clean inputs, leaving noisy scenarios largely unexplored. To address this gap, we propose DisCal (Distribution-aware Calibration), a confidence calibration framework for character-level noisy inputs. DisCal extracts uncertainty signals from both the empirical answer distribution and the model’s predictive distribution, and integrates them via a learned calibrator to produce well-calibrated confidence. Experiments across multiple mathematical reasoning benchmarks demonstrate that DisCal consistently outperforms existing calibration methods under noisy inputs, reducing Expected Calibration Error (ECE) by up to 39.21% and improving Area Under the Receiver Operating Characteristic Curve (AUROC) by up to 31.44%.

2025

pdf bib abs

Large language models (LLMs) have achieved significant advances but can potentially generate harmful content such as social biases, extremism, and misinformation. Red teaming is a promising approach to enhance model safety by creating adversarial prompts to test and improve model robustness. However, existing red-teaming methods often require expensive fine-tuning, especially for large LLMs. We propose the Dynamic Evil Score-Guided Decoding framework (DESGD), an efficient red-teaming method that does not increase computational cost with the target model size. DESGD introduces the concept of an ‘evil score’ to dynamically evaluate the potential of tokens to contribute to harmful outputs during decoding. This framework constructs a small unsafe model using an adversarial dataset and adjusts the logits vector of the target model based on the evil score. Experiments show that DESGD achieves an ASR of 92.83% on the Llama-3.2-3B-Instruct model, compared to 83.48% with adversarial fine-tuning while using less computational resources. Similarly, on the Qwen2.5-3B-Instruct model, DESGD reaches an ASR of 88.62%, outperforming adversarial fine-tuning (77.56%).

pdf bib abs

Large language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose SafeConf, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86% and 7.79% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.

Cong Gao

2026

2025

2019

Co-authors

Venues