Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance
Ao Wang, Xinghao Yang, Yongshun Gong, Wei Liu, Bao-di Liu, Weifeng Liu
Abstract
Jailbreak attacks serve as a pivotal technique for evaluating the safety alignment of Large language models. Current token-level attacks have shown remarkable efficacy on open-source models by leveraging gradient-based optimization. However, these attacks suffer from poor cross-model transferability, severely limiting their utility on proprietary ones. To address this limitation, we propose Reparameterization Invariance Gradient-based Jailbreak (RIGJ), a natural gradient based framework designed to improve cross-model transferability. Unlike prior token-level methods whose optimization paths are constrained by model-specific Euclidean geometry, RIGJ defines update directions according to differences in output distributions rather than parameter-space distances. Since language models are trained to capture similar dependency structures of natural language, their output distributions share common geometry across architectures, yielding intrinsically model-agnostic optimization trajectories and substantially stronger jailbreak transferability. Extensive experiments demonstrate superior performance, increasing the cross-model Attack Success Rate and Average Harmfulness Score by 14.9 and 1.23, respectively. Our code is provided https://github.com/nohuma/AISafety_transfer_jailbreak_RIGJ_2026.- Anthology ID:
- 2026.acl-long.357
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7854–7865
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.357/
- DOI:
- Cite (ACL):
- Ao Wang, Xinghao Yang, Yongshun Gong, Wei Liu, Bao-di Liu, and Weifeng Liu. 2026. Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7854–7865, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Enhancing the Transferability of Jailbreak Attacks on Large Language Models via Exploiting Reparameterization Invariance (Wang et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-long.357.pdf