Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim, Gary Lee


Abstract
Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets—often collected from human or web sources—makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) Defensive Poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) Backdoor Neutralization, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.
Anthology ID:
2026.acl-long.1113
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24269–24287
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1113/
DOI:
Bibkey:
Cite (ACL):
San Kim and Gary Lee. 2026. Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24269–24287, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models (Kim & Lee, ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1113.pdf
Checklist:
 2026.acl-long.1113.checklist.pdf