Stronger Universal and Transferable Attacks by Suppressing Refusals

David Huang, Avidan Shah, Alexandre Araujo, David Wagner, Chawin Sitawarin


Abstract
Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human preferences (RLHF), essentially embedding a “safety feature” into the model’s parameters. The Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023b) emerges as one of the most popular automated jailbreaks, an attack that circumvents this safety training. So far, it is believed that such optimization-based attacks (unlike hand-crafted ones) are sample-specific. To make them universal and transferable, one has to incorporate multiple samples and models into the objective function. Contrary to this belief, we find that the adversarial prompts discovered by such optimizers are inherently prompt-universal and transferable, even when optimized on a single model and a single harmful request. To further exploit this phenomenon, we introduce IRIS, a new objective to these optimizers to explicitly deactivate the safety feature to create an even stronger universal and transferable attack. Without requiring a large number of queries or accessing output token probabilities, our universal and transferable attack achieves a 25% success rate against the state-of-the-art Circuit Breaker defense (Zou et al., 2024), compared to 2.5% by white-box GCG. Crucially, IRIS also attains state-of-the-art transfer rates on frontier models: GPT-3.5-Turbo (90%), GPT-4o-mini (86%), GPT-4o (76%), o1-mini (54%), o1-preview (48%), o3-mini (66%), and deepseek-reasoner (90%).
Anthology ID:
2025.naacl-long.302
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5850–5876
Language:
URL:
https://preview.aclanthology.org/moar-dois/2025.naacl-long.302/
DOI:
10.18653/v1/2025.naacl-long.302
Bibkey:
Cite (ACL):
David Huang, Avidan Shah, Alexandre Araujo, David Wagner, and Chawin Sitawarin. 2025. Stronger Universal and Transferable Attacks by Suppressing Refusals. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5850–5876, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Stronger Universal and Transferable Attacks by Suppressing Refusals (Huang et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/moar-dois/2025.naacl-long.302.pdf