AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models

Shilong Pan, Zhiliang Tian, Zhen Huang, Wanlong Yu, Zhihua Wen, Xinwang Liu, Kai Lu, Minlie Huang, Dongsheng Li


Abstract
LLMs demonstrate remarkable utility but remain vulnerable to jailbreak attacks that aim to elicit harmful responses. Existing defenses, including post-training alignment and prompt engineering, rely on training on safety-annotated datasets and safe prompt templates, struggling with adaptability to out-of-distribution (OOD) attacks. Steering internal representations of LLMs provides real-time adjustments to defend against OOD attacks. However, it struggles with maintaining model utility, since modifying the representation disrupts the forward pass of inference. It barely considers the competitive objectives of helpfulness and harmlessness in LLMs. We argue that adversarial game-based approaches promise a solution for conflicts between the two objectives. In this paper, we propose **A**dversarial **G**ame **D**efense (AGD), an adversarial game-based defense method that dynamically adjusts LLMs’ internal representations to achieve a balanced trade-off between helpfulness and harmlessness. AGD first proposes an interquartile range (IQR) method to detect abnormal attention weights and correct the abnormal weights via adversarial training. AGD adopts a bi-level optimization to play a two-player variable-sum game to approach Nash Equilibrium (NE), where the two players adversarially refine head activations for helpfulness and harmlessness respectively. Furthermore, AGD applies an expert model to next-token sampling to generate safer responses. Experiments show that AGD significantly improves LLMs’ safety over all baselines.
Anthology ID:
2025.acl-long.851
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17391–17406
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.851/
DOI:
Bibkey:
Cite (ACL):
Shilong Pan, Zhiliang Tian, Zhen Huang, Wanlong Yu, Zhihua Wen, Xinwang Liu, Kai Lu, Minlie Huang, and Dongsheng Li. 2025. AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17391–17406, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models (Pan et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.851.pdf