SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Juan Ren, Mark Dras, Usman Naseem


Abstract
Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, and Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirections without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types—serving as a practical safety patch for both weakly and strongly aligned LVLMs.
Anthology ID:
2025.alta-main.6
Volume:
Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association
Month:
November
Year:
2025
Address:
Sydney, Australia
Editors:
Jonathan K. Kummerfeld, Aditya Joshi, Mark Dras
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
76–89
Language:
URL:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.6/
DOI:
Bibkey:
Cite (ACL):
Juan Ren, Mark Dras, and Usman Naseem. 2025. SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs. In Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association, pages 76–89, Sydney, Australia. Association for Computational Linguistics.
Cite (Informal):
SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs (Ren et al., ALTA 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-alta/2025.alta-main.6.pdf