Generative Adversarial Training with Perturbed Token Detection for Model Robustness

Jiahao Zhao, Wenji Mao


Abstract
Adversarial training is the dominant strategy towards model robustness. Current adversarial training methods typically apply perturbations to embedding representations, whereas actual text-based attacks introduce perturbations as discrete tokens. Thus there exists a gap between the continuous embedding representations and discrete text tokens that hampers the effectiveness of adversarial training. Moreover, the continuous representations of perturbations cannot be further utilized, resulting in the suboptimal performance. To bridge this gap for adversarial robustness, in this paper, we devise a novel generative adversarial training framework that integrates gradient-based learning, adversarial example generation and perturbed token detection. Our proposed framework consists of generative adversarial attack and adversarial training process. Specifically, in generative adversarial attack, the embeddings are shared between the classifier and the generative model, which enables the generative model to leverage the gradients from the classifier for generating perturbed tokens. Then, adversarial training process combines adversarial regularization with perturbed token detection to provide token-level supervision and improve the efficiency of sample utilization. Extensive experiments on five datasets from the AdvGLUE benchmark demonstrate that our framework significantly enhances the model robustness, surpassing the state-of-the-art results of ChatGPT by 10% in average accuracy.
Anthology ID:
2023.emnlp-main.804
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13012–13025
Language:
URL:
https://aclanthology.org/2023.emnlp-main.804
DOI:
10.18653/v1/2023.emnlp-main.804
Bibkey:
Cite (ACL):
Jiahao Zhao and Wenji Mao. 2023. Generative Adversarial Training with Perturbed Token Detection for Model Robustness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13012–13025, Singapore. Association for Computational Linguistics.
Cite (Informal):
Generative Adversarial Training with Perturbed Token Detection for Model Robustness (Zhao & Mao, EMNLP 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2023.emnlp-main.804.pdf
Video:
 https://preview.aclanthology.org/landing_page/2023.emnlp-main.804.mp4