Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Yifan Lan; Yuanpu Cao; Weitong Zhang; Lu Lin; Jinghui Chen

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen

Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns.In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, **P**reference **Hi**jacking (**Phi**), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation – a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.

Anthology ID:: 2025.emnlp-main.901
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17851–17876
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.901/
DOI:
Bibkey:
Cite (ACL):: Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, and Jinghui Chen. 2025. Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17851–17876, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time (Lan et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.901.pdf
Checklist:: 2025.emnlp-main.901.checklist.pdf

PDF Cite Search Checklist Fix data