Towards Unified Multimodal Large Language Models: A survey

Xu Ma; Yitian Zhang; Yun Fu

Towards Unified Multimodal Large Language Models: A survey

Abstract

The recent surge of interest in unified Multimodal Large Language Models (MLLMs) has catalyzed rapid progress toward general-purpose generation and understanding across different modalities. Despite the remarkable advancements, the field lacks a systematic and cohesive framework that connects these developments, revisits the motivations, and situates current trends within a broader landscape. In this survey, we present a comprehensive and in-depth review of unified MLLMs, offering both a methodology taxonomy and unique perspectives on the field. We begin by outlining the foundational concepts and prerequisites for understanding unified MLLMs. We then delve into designs from different aspects, including model architectures, loss functions, alignment techniques, and different representation strategies. Furthermore, we discuss persistent challenges and identify promising directions for future research. By bridging scattered progress and providing a consolidated view, this survey aims to foster a deeper and systematical understanding of unified MLLMs and inspire future innovations in building truly general multimodal intelligence.

Anthology ID:: 2026.findings-acl.1853
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37212–37230
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1853/
DOI:
Bibkey:
Cite (ACL):: Xu Ma, Yitian Zhang, and Yun Fu. 2026. Towards Unified Multimodal Large Language Models: A survey. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37212–37230, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Towards Unified Multimodal Large Language Models: A survey (Ma et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1853.pdf
Checklist:: 2026.findings-acl.1853.checklist.pdf

PDF Cite Search Checklist Fix data