Weight Poisoning Attacks on Pretrained Models

Keita Kurita, Paul Michel, Graham Neubig


Abstract
Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct “weight poisoning” attacks where pre-trained weights are injected with vulnerabilities that expose “backdoors” after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method which we call RIPPLe and an initialization procedure we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks.
Anthology ID:
2020.acl-main.249
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2793–2806
Language:
URL:
https://aclanthology.org/2020.acl-main.249
DOI:
10.18653/v1/2020.acl-main.249
Bibkey:
Cite (ACL):
Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight Poisoning Attacks on Pretrained Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2793–2806, Online. Association for Computational Linguistics.
Cite (Informal):
Weight Poisoning Attacks on Pretrained Models (Kurita et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2020.acl-main.249.pdf
Video:
 http://slideslive.com/38928910