Kenneth A. Loparo
2026
JW-SVD: Bridging the Cross-Modal Mismatch in Post-Training MLLM Compression
Runchao Li | Yao Fu | Mu Sheng | Haotian Yu | Xianxuan Long | Kenneth A. Loparo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runchao Li | Yao Fu | Mu Sheng | Haotian Yu | Xianxuan Long | Kenneth A. Loparo
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Post-training compression of Multimodal LLMs faces a fundamental geometric conflict: parameter subspaces optimized for text often suppress orthogonal visual features. We demonstrate that standard SVD fails to resolve this cross-modal mismatch, causing catastrophic visual degradation. To bridge this gap, we introduce Joint-Whitening SVD (JW-SVD), a dual-objective framework that aligns vision and language manifolds via a Joint Covariance basis, preserving features critical to both. Additionally, we propose Global Spectrum-Aware Truncation to dynamically transfer parameter budget from the redundant Vision Tower to the sensitive Backbone. Experiments on Qwen2.5-VL and Llama-3-Next confirm that JW-SVD demonstrates superior retention of both text and image capabilities. In addition, it resolves the modality trade-off: it recovers over 30% of perceptual performance lost by baselines while maintaining parity in textual reasoning, enabling robust multimodal performance even at extreme compression rates.