Juan Ren
2025
SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs
Juan Ren
|
Mark Dras
|
Usman Naseem
Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association
Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, and Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirections without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types—serving as a practical safety patch for both weakly and strongly aligned LVLMs.
Alignment of Large Language Models with Human Preferences and Values
Usman Naseem
|
Gautam Siddharth Kashyap
|
Kaixuan Ren
|
Yiran Zhang
|
Utsav Maskey
|
Juan Ren
|
Afrozah Nadeem
Proceedings of The 23rd Annual Workshop of the Australasian Language Technology Association
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their reliability and alignment with human expectations remain unresolved challenges. This tutorial introduces the foundations of alignment and provides participants with a conceptual and practical understanding of the field. Core principles such as values, safety, reasoning, and pluralism will be presented through intuitive explanations, worked examples, and case studies. The aim is to equip attendees with the ability to reason about alignment goals, understand how existing methods operate in practice, and critically evaluate their strengths and limitations.
Search
Fix author
Co-authors
- Usman Naseem 2
- Mark Dras 1
- Gautam Siddharth Kashyap 1
- Utsav Maskey 1
- Afrozah Nadeem 1
- show all...
Venues
- alta2