Stronger Universal and Transferable Attacks by Suppressing Refusals
David Huang, Avidan Shah, Alexandre Araujo, David Wagner, Chawin Sitawarin
Abstract
Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human preferences (RLHF), essentially embedding a “safety feature” into the model’s parameters. The Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023b) emerges as one of the most popular automated jailbreaks, an attack that circumvents this safety training. So far, it is believed that such optimization-based attacks (unlike hand-crafted ones) are sample-specific. To make them universal and transferable, one has to incorporate multiple samples and models into the objective function. Contrary to this belief, we find that the adversarial prompts discovered by such optimizers are inherently prompt-universal and transferable, even when optimized on a single model and a single harmful request. To further exploit this phenomenon, we introduce IRIS, a new objective to these optimizers to explicitly deactivate the safety feature to create an even stronger universal and transferable attack. Without requiring a large number of queries or accessing output token probabilities, our universal and transferable attack achieves a 25% success rate against the state-of-the-art Circuit Breaker defense (Zou et al., 2024), compared to 2.5% by white-box GCG. Crucially, IRIS also attains state-of-the-art transfer rates on frontier models: GPT-3.5-Turbo (90%), GPT-4o-mini (86%), GPT-4o (76%), o1-mini (54%), o1-preview (48%), o3-mini (66%), and deepseek-reasoner (90%).- Anthology ID:
- 2025.naacl-long.302
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Luis Chiruzzo, Alan Ritter, Lu Wang
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5850–5876
- Language:
- URL:
- https://preview.aclanthology.org/landing_page/2025.naacl-long.302/
- DOI:
- Cite (ACL):
- David Huang, Avidan Shah, Alexandre Araujo, David Wagner, and Chawin Sitawarin. 2025. Stronger Universal and Transferable Attacks by Suppressing Refusals. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5850–5876, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Stronger Universal and Transferable Attacks by Suppressing Refusals (Huang et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/landing_page/2025.naacl-long.302.pdf