Abstract
In textual backdoor attacks, attackers insert poisoned samples with triggered inputs and target labels into training datasets to manipulate model behavior, threatening the model’s security and reliability. Current defense methods can generally be categorized into inference-time and training-time ones. The former often requires a part of clean samples to set detection thresholds, which may be hard to obtain in practical application scenarios, while the latter usually requires an additional retraining or unlearning process to get a clean model, significantly increasing training costs. To avoid these drawbacks, we focus on developing a practical defense method before model training without using any clean samples. Our analysis reveals that with the help of a pre-trained language model (PLM), poisoned samples, different from clean ones, exhibit mismatched relationship and shared characteristics. Based on these observations, we further propose a two-stage poison detection strategy solely leveraging insights from PLM before model training. Extensive experiments confirm our approach’s effectiveness, achieving better performance than current leading methods more swiftly. Our code is available at https://github.com/Ascian/PKAD.- Anthology ID:
- 2024.findings-emnlp.335
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5837–5849
- Language:
- URL:
- https://aclanthology.org/2024.findings-emnlp.335
- DOI:
- 10.18653/v1/2024.findings-emnlp.335
- Cite (ACL):
- Yu Chen, Qi Cao, Kaike Zhang, Xuchao Liu, and Huawei Shen. 2024. PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5837–5849, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks (Chen et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-emnlp.335.pdf