Activation Decomposition and Steering for LLM Backdoor Remediation

Lingfeng Zhong, Qiongkai Xu, Usman Naseem


Abstract
Existing works on defending against LLM backdoor attacks rely on either auxiliary models or safety-related datasets for defending against backdoor attacks on large language models, which are not always available. To address these challenges, we propose our we propose our Contrastive-Selective Activation Decomposition and Steering (CS-ADS), which contrasts relatively more benign and poisoned settings to decompose the feature vectors for steering without relying on additional auxiliary models or datasets. With such disentangled vectors for remediation, our method can achieve feasible defense qualities even better than dataset-based contrastive steering strategies. This novel decomposition-based solution is motivated by the key insight that feature representations of prompt pairs can encode the same benign semantics in different proportions, even when both prompt pairs are similarly backdoored. Such discrepancies allow our method to identify effective remediation directions for steering the generation process, thereby preventing undesired outputs. We evaluate CS-ADS against multiple state-of-the-art backdoor attacks, and experimental results show that CS-ADS provides effective defense across settings.
Anthology ID:
2026.acl-long.2025
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
43713–43737
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2025/
DOI:
Bibkey:
Cite (ACL):
Lingfeng Zhong, Qiongkai Xu, and Usman Naseem. 2026. Activation Decomposition and Steering for LLM Backdoor Remediation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43713–43737, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Activation Decomposition and Steering for LLM Backdoor Remediation (Zhong et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2025.pdf
Checklist:
 2026.acl-long.2025.checklist.pdf