ReCon: Active Defense against Large Vision-Language Model Jailbreaks via Reverse Safety Concept Injection

Zheng He, Yiwei Wang, Hongxing Wang, Yujun Cai


Abstract
Large Vision-Language Models (LVLMs) confront an escalating threat from sophisticated multimodal jailbreak attacks. However, existing defense strategies suffer from three critical limitations: (1) the neglect of visual threats; (2) a lack of fine-grained specificity regarding specific attack semantics; and (3) the absence of a dedicated jailbreak detection mechanism, which leads to unnecessary defensive measures against benign inputs. To address these limitations, we propose ReCon, a novel black-box defense framework. ReCon integrates a diffusion-based image purifier to neutralize visual perturbations and an autoencoder-based detector for anomaly filtration. At its core, it employs a Reverse Safety Concept Injection module that maps detected unsafe concepts to fine-grained, constructive Safe Concepts, generating targeted prompts to precisely rectify attack semantics. Extensive experiments demonstrate that ReCon significantly enhances the robustness of LVLMs against jailbreak attacks while preserving performance on benign tasks. Disclaimer: Samples in this paper may be harmful and cause discomfort.
Anthology ID:
2026.findings-acl.1173
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23427–23441
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1173/
DOI:
Bibkey:
Cite (ACL):
Zheng He, Yiwei Wang, Hongxing Wang, and Yujun Cai. 2026. ReCon: Active Defense against Large Vision-Language Model Jailbreaks via Reverse Safety Concept Injection. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23427–23441, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
ReCon: Active Defense against Large Vision-Language Model Jailbreaks via Reverse Safety Concept Injection (He et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1173.pdf
Checklist:
 2026.findings-acl.1173.checklist.pdf