Layerwise universal adversarial attack on NLP models
Olga Tsymboi, Danil Malaev, Andrei Petrovskii, Ivan Oseledets
Abstract
In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1% in the fooling rate.- Anthology ID:
- 2023.findings-acl.10
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 129–143
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2023.findings-acl.10/
- DOI:
- 10.18653/v1/2023.findings-acl.10
- Cite (ACL):
- Olga Tsymboi, Danil Malaev, Andrei Petrovskii, and Ivan Oseledets. 2023. Layerwise universal adversarial attack on NLP models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 129–143, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Layerwise universal adversarial attack on NLP models (Tsymboi et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2023.findings-acl.10.pdf