Manli Shu
2025
LATTE: Learning to Think with Vision Specialists
Zixian Ma
|
Jianguo Zhang
|
Zhiwei Liu
|
Jieyu Zhang
|
Juntao Tan
|
Manli Shu
|
Juan Carlos Niebles
|
Shelby Heinecke
|
Huan Wang
|
Caiming Xiong
|
Ranjay Krishna
|
Silvio Savarese
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While open-source vision-language models perform well on simple question-answering, they still struggle with complex questions that require both perceptual and reasoning capabilities. We propose LATTE, a family of vision-language models that have LeArned to Think wiTh vision spEcialists. By offloading perception to state-of-the-art vision models, our approach enables vision-language models to focus solely on reasoning over high-quality perceptual information. To train LATTE, we synthesize and filter a large dataset of 293K multi-modal reasoning traces over perceptual outputs of vision specialists. LATTE trained on this data achieves significant 4-5% gains over baselines across 6 benchmarks covering both perception and reasoning abilities. Ablation studies reveal that the effectiveness of multi-modal reasoning traces depends on the data sources, formats, and quality of thoughts.
Search
Fix author
Co-authors
- Shelby Heinecke 1
- Ranjay Krishna 1
- Zhiwei Liu 1
- Zixian Ma 1
- Juan Carlos Niebles 1
- show all...