Speculative Safety-Aware Decoding

Xuekang Wang; Shengyu Zhu; Xueqi Cheng (程学旗)

Speculative Safety-Aware Decoding

Abstract

Despite extensive efforts to align large language models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource-intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses the desired safety property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of both models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.

Anthology ID:: 2025.emnlp-main.648
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12838–12852
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.648/
DOI:
Bibkey:
Cite (ACL):: Xuekang Wang, Shengyu Zhu, and Xueqi Cheng. 2025. Speculative Safety-Aware Decoding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 12838–12852, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Speculative Safety-Aware Decoding (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.648.pdf
Checklist:: 2025.emnlp-main.648.checklist.pdf

PDF Cite Search Checklist Fix data