Towards Improving Adversarial Training of NLP Models

Jin Yong Yoo, Yanjun Qi


Abstract
Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models’ performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model’s robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models’ standard accuracy, cross-domain generalization, and interpretability.
Anthology ID:
2021.findings-emnlp.81
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
945–956
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.81
DOI:
10.18653/v1/2021.findings-emnlp.81
Bibkey:
Cite (ACL):
Jin Yong Yoo and Yanjun Qi. 2021. Towards Improving Adversarial Training of NLP Models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 945–956, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Towards Improving Adversarial Training of NLP Models (Yoo & Qi, Findings 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2021.findings-emnlp.81.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-3/2021.findings-emnlp.81.mp4
Code
 QData/TextAttack-A2T
Data
IMDb Movie ReviewsSNLI