Learning Flexible Large Multimodal Models with Arbitrary Modality Combinations

Xinyu Zhao, Kangqi Ni, Jie Peng, Ang Li, Tianlong Chen


Abstract
Multimodal Large Language Models (MLLMs) show strong potential for cross-modal understanding by integrating powerful language models with multimodal encoders. However, extending MLLMs to handle a diverse range of modalities introduces two critical and intertwined challenges: (1) the reliance on fully paired multimodal data, often scarce or costly to acquire across all modalities, and (2) the computational inefficiency from processing numerous modality tokens and requiring substantial model updates for each new modality. To address these challenges, we enable MLLMs to handle missing modalities by generating representations for absent inputs. Furthermore, recognizing that an increasing number of modalities leads to linearly scaling token counts and that lengthy generated sequences can hinder performance, we employ a dual-stage compression mechanism. It first reduces the number of tokens per modality and then condenses information from multiple modalities into a single, compact token sequence. This culminates in Flex-M3, a novel MLLM framework designed for flexible and efficient learning across arbitrary combinations of modalities. Experiments across diverse multimodal benchmarks and backbones demonstrate that Flex-M3 robustly handles varied modality inputs and scales efficiently. Notably, Flex-M outperforms its counterpart trained on only full-modality data, with consistent improvements of 2.29%, 3.15%, 11.01% on multimodal reasoning tasks NExT-QA, MUSIC-AVQA, SQA3D. Moreover, Flex-M3 demonstrates superior robustness during inference, even when a high proportion of modalities are missing from the input samples, showcasing its capacity for complex, data-scarce multimodal applications.
Anthology ID:
2026.findings-acl.517
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10664–10678
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.517/
DOI:
Bibkey:
Cite (ACL):
Xinyu Zhao, Kangqi Ni, Jie Peng, Ang Li, and Tianlong Chen. 2026. Learning Flexible Large Multimodal Models with Arbitrary Modality Combinations. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10664–10678, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Learning Flexible Large Multimodal Models with Arbitrary Modality Combinations (Zhao et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.517.pdf
Checklist:
 2026.findings-acl.517.checklist.pdf