Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine

Heegyu Kim; Hyunsouk Cho

Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine

Abstract

Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive, making it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMsand evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses.In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be efficiently utilized in real-world service.

Anthology ID:: 2025.trustnlp-main.7
Volume:: Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 82–102
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.7/
DOI:
Bibkey:
Cite (ACL):: Heegyu Kim and Hyunsouk Cho. 2025. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 82–102, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine (Kim & Cho, TrustNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.7.pdf

PDF Cite Search Fix data