Can Textual Unlearning Solve Cross-Modality Safety Alignment?
Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, Chengyu Song
Abstract
Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.- Anthology ID:
- 2024.findings-emnlp.574
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 9830–9844
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.574/
- DOI:
- 10.18653/v1/2024.findings-emnlp.574
- Cite (ACL):
- Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, and Chengyu Song. 2024. Can Textual Unlearning Solve Cross-Modality Safety Alignment?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9830–9844, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- Can Textual Unlearning Solve Cross-Modality Safety Alignment? (Chakraborty et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.574.pdf