DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising

Zhenhao Li, Huichi Zhou, Marek Rei, Lucia Specia


Abstract
Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. The diffusion layer is trained on top of the existing classifier, ensuring seamless integration with any model in a plug-and-play manner. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over existing adversarial defense methods and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.
Anthology ID:
2025.acl-long.454
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9259–9274
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.454/
DOI:
Bibkey:
Cite (ACL):
Zhenhao Li, Huichi Zhou, Marek Rei, and Lucia Specia. 2025. DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9259–9274, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising (Li et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.454.pdf