(Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto


Abstract
Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with a text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N × M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10×, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.
Anthology ID:
2025.emnlp-main.1001
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19784–19800
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1001/
DOI:
Bibkey:
Cite (ACL):
Jaisidh Singh, Diganta Misra, Boris Knyazev, and Antonio Orvieto. 2025. (Almost) Free Modality Stitching of Foundation Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19784–19800, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
(Almost) Free Modality Stitching of Foundation Models (Singh et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1001.pdf
Checklist:
 2025.emnlp-main.1001.checklist.pdf