(Almost) Free Modality Stitching of Foundation Models
Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto
Abstract
Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with a text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter prediction capability of a hypernetwork to obtain jointly trained connector modules for N × M combinations of uni-modal models. In our experiments, Hyma reduces the cost of searching for the best performing uni-modal model pair by 10×, while matching the ranking and trained connector performance obtained via grid search across a suite of diverse multi-modal benchmarks.- Anthology ID:
- 2025.emnlp-main.1001
- Volume:
- Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19784–19800
- Language:
- URL:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1001/
- DOI:
- Cite (ACL):
- Jaisidh Singh, Diganta Misra, Boris Knyazev, and Antonio Orvieto. 2025. (Almost) Free Modality Stitching of Foundation Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19784–19800, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- (Almost) Free Modality Stitching of Foundation Models (Singh et al., EMNLP 2025)
- PDF:
- https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1001.pdf