Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models

John Timothy Halloran

Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models

Abstract

Alignment has become a critical step towards enabling large language model (LLM) safety guardrails which ensure models provide helpful and harmless responses, while refusing malicious and harmful requests. However, two separate lines of recent work–unalignment via fine-tuning, i.e., jailbreak-tuning (JT), and weight orthogonalization (WO)–have shown that LLM guardrails may be circumvented, such that LLMs obey harmful requests which they would normally refuse. Despite the safety implications of such unalignment procedures, a comprehensive analysis directly contrasting these methods is currently lacking, as is a study of these methods’ impact on malicious LLM capabilities and reasoning models. Using both JT and WO, we study the impact of unaligning six popular LLMs–three reasoning LLMs of various sizes and their instruction-tuned analogues–across harmful safety tasks. Compared to JT, we show that WO produces models which are more effective at adversarially attacking LLMs–in particular, WO reasoning LLMs excel at such adversarial attacks. Interestingly, while increasing adversarial attack efficacy, we show that WO does not drastically increase hallucination rates. This is in stark contrast to JT, which may more than double the hallucination rate of both reasoning and instruction-tuned models alike. Finally, we show that off-the-shelf supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically increasing hallucination rates.

Anthology ID:: 2026.trustnlp-main.20
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 330–341
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.20/
DOI:
Bibkey:
Cite (ACL):: John Timothy Halloran. 2026. Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 330–341, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models (Halloran, TrustNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.trustnlp-main.20.pdf

PDF Cite Search Fix data