Manli Shu


2025

pdf bib
LATTE: Learning to Think with Vision Specialists
Zixian Ma | Jianguo Zhang | Zhiwei Liu | Jieyu Zhang | Juntao Tan | Manli Shu | Juan Carlos Niebles | Shelby Heinecke | Huan Wang | Caiming Xiong | Ranjay Krishna | Silvio Savarese
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.