Zihao Yu
2025
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
Jianhong Tu
|
Zhuohao Ni
|
Nicholas Crispino
|
Zihao Yu
|
Michael Bendersky
|
Beliz Gunel
|
Ruoxi Jia
|
Xin Liu
|
Lingjuan Lyu
|
Dawn Song
|
Chenguang Wang
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.
Search
Fix author
Co-authors
- Michael Bendersky 1
- Nicholas Crispino 1
- Beliz Gunel 1
- Ruoxi Jia 1
- Xin Liu 1
- show all...