Xintong Zhang


2025

pdf bib
On Domain-Adaptive Post-Training for Multimodal Large Language Models
Daixuan Cheng | Shaohan Huang | Ziyu Zhu | Xintong Zhang | Xin Zhao | Zhongzhi Luan | Bo Dai | Zhenliang Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025

Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs via post-training, focusing on data synthesis, training pipeline, and task evaluation. (1) **Data Synthesis**: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models in enhancing domain-specific performance. (2) **Training Pipeline**: Unlike general MLLMs that typically adopt a two-stage training paradigm, we find that a single-stage approach is more effective for domain adaptation. (3) **Task Evaluation**: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Finally, we fully open-source our models, code, and data to encourage future research in this area.