Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine

Heegyu Kim, Hyunsouk Cho


Abstract
Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive, making it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMsand evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses.In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be efficiently utilized in real-world service.
Anthology ID:
2025.trustnlp-main.7
Volume:
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Month:
May
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Trista Cao, Anubrata Das, Tharindu Kumarage, Yixin Wan, Satyapriya Krishna, Ninareh Mehrabi, Jwala Dhamala, Anil Ramakrishna, Aram Galystan, Anoop Kumar, Rahul Gupta, Kai-Wei Chang
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
82–102
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.7/
DOI:
Bibkey:
Cite (ACL):
Heegyu Kim and Hyunsouk Cho. 2025. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine. In Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025), pages 82–102, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine (Kim & Cho, TrustNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.trustnlp-main.7.pdf