ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Zikang Liu; Kun Zhou; Wayne Xin Zhao; Dawei Gao; Yaliang Li; Ji-Rong Wen

doi:10.18653/v1/2025.findings-emnlp.547

ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Zikang Liu, Kun Zhou, Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen

Abstract

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale high-quality dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several downstream benchmarks, with rather less training data. Our code and data will be publicly released.

Anthology ID:: 2025.findings-emnlp.547
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10341–10366
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.547/
DOI:: 10.18653/v1/2025.findings-emnlp.547
Bibkey:
Cite (ACL):: Zikang Liu, Kun Zhou, Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. 2025. ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10341–10366, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ViFT: Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models (Liu et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.547.pdf
Checklist:: 2025.findings-emnlp.547.checklist.pdf

PDF Cite Search Checklist Fix data