Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

Renjie Pi, Kehao Miao, Li Peihang, Runtao Liu, Jiahui Gao, Jipeng Zhang, Xiaofang Zhou


Abstract
Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the “sycophantic modality gap.” To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user’s instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.
Anthology ID:
2025.emnlp-main.1020
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20177–20191
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1020/
DOI:
Bibkey:
Cite (ACL):
Renjie Pi, Kehao Miao, Li Peihang, Runtao Liu, Jiahui Gao, Jipeng Zhang, and Xiaofang Zhou. 2025. Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20177–20191, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models (Pi et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1020.pdf
Checklist:
 2025.emnlp-main.1020.checklist.pdf