Leveraging Extracted Model Adversaries for Improved Black Box Attacks

Naveen Jafer Nizar, Ari Kobren


Abstract
We present a method for adversarial input generation against black box models for reading comprehension based question answering. Our approach is composed of two steps. First, we approximate a victim black box model via model extraction. Second, we use our own white box method to generate input perturbations that cause the approximate model to fail. These perturbed inputs are used against the victim. In experiments we find that our method improves on the efficacy of the ADDANY—a white box attack—performed on the approximate model by 25% F1, and the ADDSENT attack—a black box attack—by 11% F1.
Anthology ID:
2020.blackboxnlp-1.6
Volume:
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2020
Address:
Online
Editors:
Afra Alishahi, Yonatan Belinkov, Grzegorz Chrupała, Dieuwke Hupkes, Yuval Pinter, Hassan Sajjad
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
57–67
Language:
URL:
https://aclanthology.org/2020.blackboxnlp-1.6
DOI:
10.18653/v1/2020.blackboxnlp-1.6
Bibkey:
Cite (ACL):
Naveen Jafer Nizar and Ari Kobren. 2020. Leveraging Extracted Model Adversaries for Improved Black Box Attacks. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 57–67, Online. Association for Computational Linguistics.
Cite (Informal):
Leveraging Extracted Model Adversaries for Improved Black Box Attacks (Nizar & Kobren, BlackboxNLP 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2020.blackboxnlp-1.6.pdf
Data
SQuAD