Hao Wu
Other people with similar names: Hao Wu, Hao Wu, Hao Wu, Hao Wu
Unverified author pages with similar names: Hao Wu
2026
Calibrating Inference Time Alignment with Sequence-level Risk Accumulation
Shanwen Tan | Ziyang Dong | Wei Ju | Yiwei Fu | Hao Wu | Kun Wang | Yifan Wang | Ziyue Qiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shanwen Tan | Ziyang Dong | Wei Ju | Yiwei Fu | Hao Wu | Kun Wang | Yifan Wang | Ziyue Qiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper investigates the problem of safe decoding for Large Language Models (LLMs) during inference, particularly under jailbreak attacks. Previous approaches typically either detect malicious content or regulate the decoding alignment of LLMs to mitigate such attacks. Although effective in defending against attacks, these methods often over-reject benign content, limiting their generalizability in real-world scenarios where harmful and benign information coexist. Towards this end, we propose an innovative framework named Sequence-level risk Accumulation for calibrating test-time alignment (SEAT). Specifically, SEAT introduces a reward-guided branch decoding paradigm to incorporate safety awareness during generation. To balance the detection of harmful content with the accurate response to benign information, SEAT employs a sequence-level risk monitor that smooths risk signals over the entire sequence, preventing over-confident refusals for certain tokens. Furthermore, we conduct extensive experiments on four attack benchmarks and two neutral datasets, comparing SEAT with eight state-of-the-art baselines. Consequently, the results demonstrate that SEAT achieves superior performance both in defending against jailbreak attacks and in generating high-quality responses on neutral datasets. Our code is available at https://github.com/ShanwenTan/SEAT.