Sara Ghaboura


2025

pdf bib
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura | Ahmed Heakl | Omkar Thawakar | Ali Husain Salem Abdulla Alharthi | Ines Riahi | Abduljalil Radman | Jorma Laaksonen | Fahad Shahbaz Khan | Salman Khan | Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: NAACL 2025

Recent years have witnessed a significant interest in developing large multi-modal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the bestopen-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark will be publicly released.

pdf bib
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Sara Ghaboura | Ketan Pravin More | Ritesh Thawkar | Wafa Al Ghallabi | Omkar Thawakar | Fahad Shahbaz Khan | Hisham Cholakkal | Salman Khan | Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: ACL 2025

Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models’ capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. We release the TimeTravel dataset and evaluation suite as open-source resources for culturally and historically informed research.