Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang; Seungjong Sun; Hyunmin Jun; Jang Hyun Kim

Jailbreaking Multimodal Large Language Models using Multi-Clip Video

Choongwon Kang, Seungjong Sun, Hyunmin Jun, Jang Hyun Kim

Abstract

As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality. Warning: This paper may contain potentially offensive content.

Anthology ID:: 2026.acl-long.1186
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25863–25889
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1186/
DOI:
Bibkey:
Cite (ACL):: Choongwon Kang, Seungjong Sun, Hyunmin Jun, and Jang Hyun Kim. 2026. Jailbreaking Multimodal Large Language Models using Multi-Clip Video. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25863–25889, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Jailbreaking Multimodal Large Language Models using Multi-Clip Video (Kang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1186.pdf
Checklist:: 2026.acl-long.1186.checklist.pdf

PDF Cite Search Checklist Fix data