Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models
Zhengxiao Liu, Bowen Shen, Zheng Lin, Fali Wang, Weiping Wang
Abstract
Pre-trained language model (PLM) can be stealthily misled to target outputs by backdoor attacks when encountering poisoned samples, without performance degradation on clean samples. The stealthiness of backdoor attacks is commonly attained through minimal cross-entropy loss fine-tuning on a union of poisoned and clean samples. Existing defense paradigms provide a workaround by detecting and removing poisoned samples at pre-training or inference time. On the contrary, we provide a new perspective where the backdoor attack is directly reversed. Specifically, maximum entropy loss is incorporated in training to neutralize the minimal cross-entropy loss fine-tuning on poisoned data. We defend against a range of backdoor attacks on classification tasks and significantly lower the attack success rate. In extension, we explore the relationship between intended backdoor attacks and unintended dataset bias, and demonstrate the feasibility of the maximum entropy principle in de-biasing.- Anthology ID:
- 2023.findings-acl.237
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3850–3868
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.237
- DOI:
- 10.18653/v1/2023.findings-acl.237
- Cite (ACL):
- Zhengxiao Liu, Bowen Shen, Zheng Lin, Fali Wang, and Weiping Wang. 2023. Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3850–3868, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models (Liu et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/add_acl24_videos/2023.findings-acl.237.pdf