Kenneth A. Loparo


2026

Post-training compression of Multimodal LLMs faces a fundamental geometric conflict: parameter subspaces optimized for text often suppress orthogonal visual features. We demonstrate that standard SVD fails to resolve this cross-modal mismatch, causing catastrophic visual degradation. To bridge this gap, we introduce Joint-Whitening SVD (JW-SVD), a dual-objective framework that aligns vision and language manifolds via a Joint Covariance basis, preserving features critical to both. Additionally, we propose Global Spectrum-Aware Truncation to dynamically transfer parameter budget from the redundant Vision Tower to the sensitive Backbone. Experiments on Qwen2.5-VL and Llama-3-Next confirm that JW-SVD demonstrates superior retention of both text and image capabilities. In addition, it resolves the modality trade-off: it recovers over 30% of perceptual performance lost by baselines while maintaining parity in textual reasoning, enabling robust multimodal performance even at extreme compression rates.