More Thinking, Less Talking: Internalizing Deliberative Safety into LLM Parameters

Guan Wang, Xuehai Tang, Biyu Zhou, Jizhong Han, Songlin Hu


Abstract
Prevailing safety alignment methods still leave Large Language Models (LLMs) vulnerable to sophisticated jailbreak attacks. To bolster defenses, explicit reasoning mechanisms like Safety-oriented Chain-of-Thought (SCoT) have emerged, significantly enhancing robustness. However, this transparency introduces a critical trade-off: the exposed reasoning process itself becomes a new attack surface, risking the leakage of harmful information and revealing the model’s safety logic to adversaries. This paper directly confronts this dilemma, asking: Can we achieve the full benefits of deliberative safety without the costs of explicit reasoning generation? We propose Safety Reasoning Internalization to make the deliberative process in SCoT "available but not visible". This approach is grounded in a key theoretical insight: the corrective influence of an SCoT can be effectively approximated by a targeted, low-rank update to the model’s Feed-Forward Network (FFN) layers. We operationalize this through Hierarchical Internalization of Adversarially-Guided Reasoning (HIAR), a layer-wise safety alignment framework that internalizes safety reasoning into an implicit computational pathway using Low-Rank Adaptation (LoRA). HIAR enables the model to reach a safe conclusion within a single forward pass, entirely eliminating the need to generate vulnerable SCoT text. Extensive experiments on various LLMs demonstrate that HIAR achieves a 43% lower Attack Success Rate (ASR) against distinct jailbreak attacks compared to strong baselines.
Anthology ID:
2026.acl-long.1572
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
34079–34094
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1572/
DOI:
Bibkey:
Cite (ACL):
Guan Wang, Xuehai Tang, Biyu Zhou, Jizhong Han, and Songlin Hu. 2026. More Thinking, Less Talking: Internalizing Deliberative Safety into LLM Parameters. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34079–34094, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
More Thinking, Less Talking: Internalizing Deliberative Safety into LLM Parameters (Wang et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1572.pdf
Checklist:
 2026.acl-long.1572.checklist.pdf