Hao Wu

Other people with similar names: Hao Wu, Hao Wu, Hao Wu, Hao Wu

Unverified author pages with similar names: Hao Wu

2026

This paper investigates the problem of safe decoding for Large Language Models (LLMs) during inference, particularly under jailbreak attacks. Previous approaches typically either detect malicious content or regulate the decoding alignment of LLMs to mitigate such attacks. Although effective in defending against attacks, these methods often over-reject benign content, limiting their generalizability in real-world scenarios where harmful and benign information coexist. Towards this end, we propose an innovative framework named Sequence-level risk Accumulation for calibrating test-time alignment (SEAT). Specifically, SEAT introduces a reward-guided branch decoding paradigm to incorporate safety awareness during generation. To balance the detection of harmful content with the accurate response to benign information, SEAT employs a sequence-level risk monitor that smooths risk signals over the entire sequence, preventing over-confident refusals for certain tokens. Furthermore, we conduct extensive experiments on four attack benchmarks and two neutral datasets, comparing SEAT with eight state-of-the-art baselines. Consequently, the results demonstrate that SEAT achieves superior performance both in defending against jailbreak attacks and in generating high-quality responses on neutral datasets. Our code is available at https://github.com/ShanwenTan/SEAT.

Co-authors

Kun Wang 1

Yifan Wang 1

Venues

ACL1

Fix author