Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level

Xinyi Zeng, Yuying Shang, Jiawei Chen, Jingyuan Zhang, Yu Tian


Abstract
Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful outputs from the prefill-level lacks utilization of the model’s decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful outputs based on a single evaluation can significantly impair the model’s helpfulness. To address the above issues, we examine LLMs’ capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects the outputs of harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost safe decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model’s ability to discern hazardous information, maintaining its helpfulness compared to existing methods.
Anthology ID:
2025.acl-long.97
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1974–1988
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.97/
DOI:
Bibkey:
Cite (ACL):
Xinyi Zeng, Yuying Shang, Jiawei Chen, Jingyuan Zhang, and Yu Tian. 2025. Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1974–1988, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level (Zeng et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.97.pdf