Semantic-Preserving Adversarial Example Attack against BERT

Chongyang Gao, Kang Gu, Soroush Vosoughi, Shagufta Mehnaz


Abstract
Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.
Anthology ID:
2024.trustnlp-1.17
Volume:
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kai-Wei Chang, Anaelia Ovalle, Jieyu Zhao, Yang Trista Cao, Ninareh Mehrabi, Aram Galstyan, Jwala Dhamala, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
202–207
Language:
URL:
https://aclanthology.org/2024.trustnlp-1.17
DOI:
Bibkey:
Cite (ACL):
Chongyang Gao, Kang Gu, Soroush Vosoughi, and Shagufta Mehnaz. 2024. Semantic-Preserving Adversarial Example Attack against BERT. In Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024), pages 202–207, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Semantic-Preserving Adversarial Example Attack against BERT (Gao et al., TrustNLP-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/jeptaln-2024-ingestion/2024.trustnlp-1.17.pdf