Representation Potentials of Foundation Models for Multimodal Alignment: A Survey

Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, Yun Fu


Abstract
Foundation models learn highly transferable representations through large-scale pretraining on diverse data. An increasing body of research indicates that these representations exhibit a remarkable degree of similarity across architectures and modalities. In this survey, we investigate the representation potentials of foundation models, defined as the latent capacity of their learned representations to capture task-specific information within a single modality while also providing a transferable basis for alignment and unification across modalities. We begin by reviewing representative foundation models and the key metrics that make alignment measurable. We then synthesize empirical evidence of representation potentials from studies in vision, language, speech, multimodality, and neuroscience. The evidence suggests that foundation models often exhibit structural regularities and semantic consistencies in their representation spaces, positioning them as strong candidates for cross-modal transfer and alignment. We further analyze the key factors that foster representation potentials, discuss open questions, and highlight potential challenges.
Anthology ID:
2025.emnlp-main.843
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16680–16695
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.843/
DOI:
Bibkey:
Cite (ACL):
Jianglin Lu, Hailing Wang, Yi Xu, Yizhou Wang, Kuo Yang, and Yun Fu. 2025. Representation Potentials of Foundation Models for Multimodal Alignment: A Survey. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16680–16695, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Representation Potentials of Foundation Models for Multimodal Alignment: A Survey (Lu et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.843.pdf
Checklist:
 2025.emnlp-main.843.checklist.pdf