Ahmed Heakl


2025

pdf bib
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Sara Ghaboura | Ahmed Heakl | Omkar Thawakar | Ali Husain Salem Abdulla Alharthi | Ines Riahi | Abduljalil Radman | Jorma Laaksonen | Fahad Shahbaz Khan | Salman Khan | Rao Muhammad Anwer
Findings of the Association for Computational Linguistics: NAACL 2025

Recent years have witnessed a significant interest in developing large multi-modal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the bestopen-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark will be publicly released.

pdf bib
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Ahmed Heakl | Muhammad Abdullah Sohail | Mukul Ranjan | Rania Elbadry | Ghazi Shazan Ahmad | Mohamed El-Geish | Omar Maher | Zhiqiang Shen | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025

With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 subdomains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (such as EasyOCR, PaddleOCR, and Surya) by an average of 60% in the character error rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges of accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.

pdf bib
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar | Dinura Dissanayake | Ketan Pravin More | Ritesh Thawkar | Ahmed Heakl | Noor Ahsan | Yuhao Li | Ilmuz Zaman Mohammed Zumri | Jean Lahoud | Rao Muhammad Anwer | Hisham Cholakkal | Ivan Laptev | Mubarak Shah | Fahad Shahbaz Khan | Salman Khan
Findings of the Association for Computational Linguistics: ACL 2025

Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.