SCOPE: Preserving Modality-Specific Cues to Mitigate Modality Laziness in Multimodal Learning

Jingfan Yang, Rui Zhang, Liang Hong, Ke Yuan


Abstract
Multimodal learning aims to learn unified multimodal representations from heterogeneous modalities and supports many natural language processing tasks. However, multimodal models often exhibit modality laziness: over-relying on a dominant modality and under-exploiting complementary signals. Existing approaches typically strengthen unimodal training or rebalance modality contributions, but they may still emphasize shared semantics and overlook modality-specific cues. To address this, we propose SCOPE, a unified framework for learning complete multimodal representations, achieving Shared-and-COmplementary cue PrEservation. Firstly, SCOPE uses a mutual information-guided disentanglement module to separate shared semantics from modality-specific cues and mitigate representation collapse. Secondly, SCOPE aligns modalities by enforcing structural consistency between modality-wise semantic graphs, avoiding brittle point-wise matching. Finally, SCOPE performs balanced fusion via structure-aware diffusion attention to integrate shared and complementary cues without feature homogenization. Experiments on four benchmark datasets show that SCOPE consistently outperforms SOTA baselines, achieving up to 27.10% accuracy improvement.
Anthology ID:
2026.findings-acl.1453
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29066–29078
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1453/
DOI:
Bibkey:
Cite (ACL):
Jingfan Yang, Rui Zhang, Liang Hong, and Ke Yuan. 2026. SCOPE: Preserving Modality-Specific Cues to Mitigate Modality Laziness in Multimodal Learning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29066–29078, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SCOPE: Preserving Modality-Specific Cues to Mitigate Modality Laziness in Multimodal Learning (Yang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1453.pdf
Checklist:
 2026.findings-acl.1453.checklist.pdf