Noor Ahsan


2025

pdf bib
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar | Dinura Dissanayake | Ketan Pravin More | Ritesh Thawkar | Ahmed Heakl | Noor Ahsan | Yuhao Li | Ilmuz Zaman Mohammed Zumri | Jean Lahoud | Rao Muhammad Anwer | Hisham Cholakkal | Ivan Laptev | Mubarak Shah | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025

Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.