Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder

Alvin Chan, Yi Tay, Yew-Soon Ong, Aston Zhang


Abstract
This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems. More concretely, we present a ‘backdoor poisoning’ attack on NLP models. Our poisoning attack utilizes conditional adversarially regularized autoencoder (CARA) to generate poisoned training samples by poison injection in latent space. Just by adding 1% poisoned data, our experiments show that a victim BERT finetuned classifier’s predictions can be steered to the poison target class with success rates of >80% when the input hypothesis is injected with the poison signature, demonstrating that NLI and text classification systems face a huge security risk.
Anthology ID:
2020.findings-emnlp.373
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4175–4189
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.373
DOI:
10.18653/v1/2020.findings-emnlp.373
Bibkey:
Cite (ACL):
Alvin Chan, Yi Tay, Yew-Soon Ong, and Aston Zhang. 2020. Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4175–4189, Online. Association for Computational Linguistics.
Cite (Informal):
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder (Chan et al., Findings 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.findings-emnlp.373.pdf
Video:
 https://slideslive.com/38940808
Code
 alvinchangw/CARA_EMNLP2020 +  additional community code
Data
MultiNLISNLI