Kang Gu
2025
From Insight to Exploit: Leveraging LLM Collaboration for Adaptive Adversarial Text Generation
Najrin Sultana
|
Md Rafi Ur Rashid
|
Kang Gu
|
Shagufta Mehnaz
Findings of the Association for Computational Linguistics: EMNLP 2025
LLMs can provide substantial zero-shot performance on diverse tasks using a simple task prompt, eliminating the need for training or fine-tuning. However, when applying these models to sensitive tasks, it is crucial to thoroughly assess their robustness against adversarial inputs. In this work, we introduce Static Deceptor (StaDec) and Dynamic Deceptor (DyDec), two innovative attack frameworks designed to systematically generate dynamic and adaptive adversarial examples by leveraging the understanding of the LLMs. We produce subtle and natural-looking adversarial inputs that preserve semantic similarity to the original text while effectively deceiving the target LLM. By utilizing an automated, LLM-driven pipeline, we eliminate the dependence on external heuristics. Our attacks evolve with the advancements in LLMs, while demonstrating a strong transferability across models unknown to the attacker. Overall, this work provides a systematic approach for self-assessing the robustness of the LLMs. We release our code and data at https://github.com/Shukti042/AdversarialExample.
2024
Semantic-Preserving Adversarial Example Attack against BERT
Chongyang Gao
|
Kang Gu
|
Soroush Vosoughi
|
Shagufta Mehnaz
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Adversarial example attacks against textual data have been drawing increasing attention in both the natural language processing (NLP) and security domains. However, most of the existing attacks overlook the importance of semantic similarity and yield easily recognizable adversarial samples. As a result, the defense methods developed in response to these attacks remain vulnerable and could be evaded by advanced adversarial examples that maintain high semantic similarity with the original, non-adversarial text. Hence, this paper aims to investigate the extent of textual adversarial examples in maintaining such high semantic similarity. We propose Reinforce attack, a reinforcement learning-based framework to generate adversarial text that preserves high semantic similarity with the original text. In particular, the attack process is controlled by a reward function rather than heuristics, as in previous methods, to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our generated adversarial texts preserve significantly higher semantic similarity than state-of-the-art attacks while achieving similar attack success rates (outperforming at times), thus uncovering novel challenges for effective defenses.