Empowering Reliable Visual-Centric Instruction Following in MLLMs

Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, Minhao Cheng


Abstract
Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs’ instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs’ instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.
Anthology ID:
2026.findings-acl.461
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9460–9482
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.461/
DOI:
Bibkey:
Cite (ACL):
Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, and Minhao Cheng. 2026. Empowering Reliable Visual-Centric Instruction Following in MLLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 9460–9482, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Empowering Reliable Visual-Centric Instruction Following in MLLMs (He et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.461.pdf
Checklist:
 2026.findings-acl.461.checklist.pdf