Smart Lexical Search for Label Flipping Adversial Attack
Alberto Gutiérrez-Megías, Salud María Jiménez-Zafra, L. Alfonso Ureña, Eugenio Martínez-Cámara
Abstract
Language models are susceptible to vulnerability through adversarial attacks, using manipulations of the input data to disrupt their performance. Accordingly, it represents a cibersecurity leak. Data manipulations are intended to be unidentifiable by the learning model and by humans, small changes can disturb the final label of a classification task. Hence, we propose a novel attack built upon explainability methods to identify the salient lexical units to alter in order to flip the classification label. We asses our proposal on a disinformation dataset, and we show that our attack reaches high balance among stealthiness and efficiency.- Anthology ID:
- 2024.privatenlp-1.11
- Volume:
- Proceedings of the Fifth Workshop on Privacy in Natural Language Processing
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, Oluwaseyi Feyisetan
- Venues:
- PrivateNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 97–106
- Language:
- URL:
- https://aclanthology.org/2024.privatenlp-1.11
- DOI:
- Cite (ACL):
- Alberto Gutiérrez-Megías, Salud María Jiménez-Zafra, L. Alfonso Ureña, and Eugenio Martínez-Cámara. 2024. Smart Lexical Search for Label Flipping Adversial Attack. In Proceedings of the Fifth Workshop on Privacy in Natural Language Processing, pages 97–106, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Smart Lexical Search for Label Flipping Adversial Attack (Gutiérrez-Megías et al., PrivateNLP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.privatenlp-1.11.pdf