CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack

Siqi Sun; Procheta Sen; Wenjie Ruan

doi:10.18653/v1/2024.findings-emnlp.993

CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack

Abstract

Language models are vulnerable to clandestinely modified data and manipulation by attackers. Despite considerable research dedicated to enhancing robustness against adversarial attacks, the realm of provable robustness for backdoor attacks remains relatively unexplored. In this paper, we initiate a pioneering investigation into the certified robustness of NLP models against backdoor triggers.We propose a model-agnostic mechanism for large-scale models that applies to complex model structures without the need for assessing model architecture or internal knowledge. More importantly, we take recent advances in randomized smoothing theory and propose a novel weight-based distribution algorithm to enable semantic similarity and provide theoretical robustness guarantees.Experimentally, we demonstrate the efficacy of our approach across a diverse range of datasets and tasks, highlighting its utility in mitigating backdoor triggers. Our results show strong performance in terms of certified accuracy, scalability, and semantic preservation.

Anthology ID:: 2024.findings-emnlp.993
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17056–17070
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2024.findings-emnlp.993/
DOI:: 10.18653/v1/2024.findings-emnlp.993
Bibkey:
Cite (ACL):: Siqi Sun, Procheta Sen, and Wenjie Ruan. 2024. CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17056–17070, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: CROWD: Certified Robustness via Weight Distribution for Smoothed Classifiers against Backdoor Attack (Sun et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2024.findings-emnlp.993.pdf

PDF Cite Search Fix data