Can Textual Unlearning Solve Cross-Modality Safety Alignment?

Trishna Chakraborty; Erfan Shayegani; Zikui Cai; Nael B. Abu-Ghazaleh; M. Salman Asif; Yue Dong; Amit Roy-Chowdhury; Chengyu Song

doi:10.18653/v1/2024.findings-emnlp.574

Can Textual Unlearning Solve Cross-Modality Safety Alignment?

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, Chengyu Song

Abstract

Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands.

Anthology ID:: 2024.findings-emnlp.574
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9830–9844
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.574/
DOI:: 10.18653/v1/2024.findings-emnlp.574
Bibkey:
Cite (ACL):: Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, and Chengyu Song. 2024. Can Textual Unlearning Solve Cross-Modality Safety Alignment?. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9830–9844, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Can Textual Unlearning Solve Cross-Modality Safety Alignment? (Chakraborty et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.574.pdf

PDF Cite Search Fix data