Yaqing Shi
2026
VFA: Empowering Multilingual MLLMs via Vision-Free Adaptation
Yixia Li | Yaqing Shi | Zhiwen Ruan | Dongdong Zhang | Lingjie Jiang | Shaohan Huang | Yun Chen | Guanhua Chen | Furu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yixia Li | Yaqing Shi | Zhiwen Ruan | Dongdong Zhang | Lingjie Jiang | Shaohan Huang | Yun Chen | Guanhua Chen | Furu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal large language models have advanced rapidly, yet most remain English-centric, as scaling multilingual multimodal instruction tuning is limited by the scarcity and high cost of high-quality non-English image–text supervision. Although multilingual text data is abundant, naive textual fine-tuning can disrupt vision–language alignment and induce catastrophic forgetting. We propose Vision-Free Adaptation (VFA), a framework that decouples multilingual language enhancement from visual alignment by composing complementary task vectors over a shared LLM backbone. Specifically, we fine-tune a base LLM on multilingual text data to derive a multilingual task vector, which is then merged with the vision-aligned task vector of an MLLM. Experiments on five MLLMs across six multilingual multimodal benchmarks show consistent improvements while preserving both general multimodal and text-only capabilities. Moreover, VFA attains competitive performance with a fully multimodally trained model using less than 2% of the text data, demonstrating its efficiency and effectiveness.