Abstract
Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.- Anthology ID:
- 2024.trustnlp-1.17
- Volume:
- Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Anaelia Ovalle, Kai-Wei Chang, Yang Trista Cao, Ninareh Mehrabi, Jieyu Zhao, Aram Galstyan, Jwala Dhamala, Anoop Kumar, Rahul Gupta
- Venues:
- TrustNLP | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 202–207
- Language:
- URL:
- https://aclanthology.org/2024.trustnlp-1.17
- DOI:
- 10.18653/v1/2024.trustnlp-1.17
- Cite (ACL):
- Chongyang Gao, Kang Gu, Soroush Vosoughi, and Shagufta Mehnaz. 2024. Semantic-Preserving Adversarial Example Attack against BERT. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pages 202–207, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Semantic-Preserving Adversarial Example Attack against BERT (Gao et al., TrustNLP-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.trustnlp-1.17.pdf