SAFER: A Controllable Safeguard for LLMs against Backdoor Attacks

Zirui Hu, Zheng Zhang, Yingjie Wang, Dacheng Tao


Abstract
Large language models (LLMs) have achieved remarkable performance across a wide range of natural language processing (NLP) tasks. However, they remain susceptible to backdoor attacks, where adversaries embed hidden triggers in the input to induce malicious, attacker-specified behaviors. While existing inference-time defenses aim to mitigate such threats by detecting and filtering poisoned inputs, they often lack explicit control over the false acceptance rate (FAR)—a critical requirement in safety-sensitive settings where even rare failures can lead to catastrophic consequences. To address this challenge, we propose SAFER, a novel inference-time defense framework that provides explicit and provable control over FAR without requiring prior knowledge of backdoor samples. SAFER leverages distributional information from available data to estimate the likelihood that an input is clean and selects inputs accordingly. From a theoretical perspective, we demonstrate that SAFER asymptotically guarantees control of the true FAR. Empirical evaluations on three benchmark datasets across diverse backdoor attack scenarios show that SAFER consistently achieves reliable FAR control while maintaining high detection power, significantly outperforming existing inference-time defenses.
Anthology ID:
2026.findings-acl.705
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14380–14398
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.705/
DOI:
Bibkey:
Cite (ACL):
Zirui Hu, Zheng Zhang, Yingjie Wang, and Dacheng Tao. 2026. SAFER: A Controllable Safeguard for LLMs against Backdoor Attacks. In Findings of the Association for Computational Linguistics: ACL 2026, pages 14380–14398, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SAFER: A Controllable Safeguard for LLMs against Backdoor Attacks (Hu et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.705.pdf
Checklist:
 2026.findings-acl.705.checklist.pdf