Empowering Reliable Visual-Centric Instruction Following in MLLMs

Weilei He; Feng Ju; Zhiyuan Fan; Rui Min; Minhao Cheng

Empowering Reliable Visual-Centric Instruction Following in MLLMs

Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, Minhao Cheng

Abstract

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs’ instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs’ instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.

Anthology ID:: 2026.findings-acl.461
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9460–9482
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.461/
DOI:
Bibkey:
Cite (ACL):: Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, and Minhao Cheng. 2026. Empowering Reliable Visual-Centric Instruction Following in MLLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 9460–9482, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Empowering Reliable Visual-Centric Instruction Following in MLLMs (He et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.461.pdf
Checklist:: 2026.findings-acl.461.checklist.pdf

PDF Cite Search Checklist Fix data