Hongxing Wang

Other people with similar names: Hongxing Wang

2026

ReCon: Active Defense against Large Vision-Language Model Jailbreaks via Reverse Safety Concept Injection
Zheng He | Yiwei Wang | Hongxing Wang | Yujun Cai
Findings of the Association for Computational Linguistics: ACL 2026

Large Vision-Language Models (LVLMs) confront an escalating threat from sophisticated multimodal jailbreak attacks. However, existing defense strategies suffer from three critical limitations: (1) the neglect of visual threats; (2) a lack of fine-grained specificity regarding specific attack semantics; and (3) the absence of a dedicated jailbreak detection mechanism, which leads to unnecessary defensive measures against benign inputs. To address these limitations, we propose ReCon, a novel black-box defense framework. ReCon integrates a diffusion-based image purifier to neutralize visual perturbations and an autoencoder-based detector for anomaly filtration. At its core, it employs a Reverse Safety Concept Injection module that maps detected unsafe concepts to fine-grained, constructive Safe Concepts, generating targeted prompts to precisely rectify attack semantics. Extensive experiments demonstrate that ReCon significantly enhances the robustness of LVLMs against jailbreak attacks while preserving performance on benign tasks. Disclaimer: Samples in this paper may be harmful and cause discomfort.

Co-authors

Venues

Findings1

Fix author