From Shortcuts to Triggers: Backdoor Defense with Denoised PoE

Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen


Abstract
Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on three NLP tasks show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of triggers.
Anthology ID:
2024.naacl-long.27
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
483–496
Language:
URL:
https://aclanthology.org/2024.naacl-long.27
DOI:
10.18653/v1/2024.naacl-long.27
Bibkey:
Cite (ACL):
Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. 2024. From Shortcuts to Triggers: Backdoor Defense with Denoised PoE. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 483–496, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE (Liu et al., NAACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.naacl-long.27.pdf