Ritesh Thawkar
2025
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Sara Ghaboura
|
Ketan Pravin More
|
Ritesh Thawkar
|
Wafa Al Ghallabi
|
Omkar Thawakar
|
Fahad Shahbaz Khan
|
Hisham Cholakkal
|
Salman Khan
|
Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: ACL 2025
Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models’ capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. We release the TimeTravel dataset and evaluation suite as open-source resources for culturally and historically informed research.
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar
|
Dinura Dissanayake
|
Ketan Pravin More
|
Ritesh Thawkar
|
Ahmed Heakl
|
Noor Ahsan
|
Yuhao Li
|
Ilmuz Zaman Mohammed Zumri
|
Jean Lahoud
|
Rao Muhammad Anwer
|
Hisham Cholakkal
|
Ivan Laptev
|
Mubarak Shah
|
Fahad Shahbaz Khan
|
Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025
Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.