Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu


Abstract
Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.
Anthology ID:
2026.findings-acl.383
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7751–7767
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.383/
DOI:
Bibkey:
Cite (ACL):
Cong Wan, Ying He, Zhongzhan Huang, and Hefeng Wu. 2026. Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 7751–7767, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning (Wan et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.383.pdf
Checklist:
 2026.findings-acl.383.checklist.pdf