PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

Zijing Wang, YongKang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, Hinrich Schuetze


Abstract
Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text’s reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions.Our repository is on https://github.com/wzj1718/PlaM .
Anthology ID:
2026.findings-acl.1056
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
21019–21035
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1056/
DOI:
Bibkey:
Cite (ACL):
Zijing Wang, YongKang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, and Hinrich Schuetze. 2026. PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21019–21035, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1056.pdf
Checklist:
 2026.findings-acl.1056.checklist.pdf