MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Jianhong Tu; Zhuohao Ni; Nicholas Crispino; Zihao Yu; Michael Bendersky; Beliz Gunel; Ruoxi Jia; Xin Liu; Lingjuan Lyu; Dawn Song; Chenguang Wang (王晨光)

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, Chenguang Wang

Abstract

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

Anthology ID:: 2025.knowllm-1.6
Volume:: Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Yuji Zhang, Canyu Chen, Sha Li, Mor Geva, Chi Han, Xiaozhi Wang, Shangbin Feng, Silin Gao, Isabelle Augenstein, Mohit Bansal, Manling Li, Heng Ji
Venues:: KnowLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 59–74
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.knowllm-1.6/
DOI:
Bibkey:
Cite (ACL):: Jianhong Tu, Zhuohao Ni, Nicholas Crispino, Zihao Yu, Michael Bendersky, Beliz Gunel, Ruoxi Jia, Xin Liu, Lingjuan Lyu, Dawn Song, and Chenguang Wang. 2025. MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models. In Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM), pages 59–74, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models (Tu et al., KnowLLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.knowllm-1.6.pdf

PDF Cite Search Fix data