Rao Anwer

2026

As Large Multimodal Models (LMMs) become more capable, there is growing interest in evaluating their reasoning processes alongside their final outputs. However, most existing benchmarks remain focused on English, overlooking languages with rich linguistic and cultural depth such as Arabic. To address this gap, we introduce the Comprehensive Arabic Multimodal Reasoning Benchmark (ARB), the first benchmark designed to evaluate step-by-step reasoning in Arabic across both textual and visual modalities. ARB covers 11 diverse domains and over 40 subfields, including visual reasoning, optical character recognition, scientific analysis, and cultural interpretation. It comprises 2,219 multimodal samples paired with over 8K human-curated reasoning steps and corresponding actions, verified through a human-in-the-loop process. We evaluated 15 state-of-the-art open- and closed-source LMMs and found persistent challenges in coherence, faithfulness, and cultural grounding. ARB provides a structured framework for diagnosing multimodal reasoning in underrepresented languages, marking a critical step toward inclusive, transparent, and culturally aware AI systems. The benchmark, rubric, and evaluation suite are publicly available

pdf bib abs

AgriChain: Visually-Grounded Expert-Verified Reasoning for Interpretable Agricultural Vision–Language Models
Hazza Mahmood | Yongqiang Yu | Rao Anwer
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Accurate and interpretable plant disease diagnosis remains a key challenge for vision–language models in real agricultural settings. We present AgriChain, a new dataset of around 11,000 expert-curated leaf images covering a wide range of crops and diseases. Each image is paired with a disease label, a calibrated confidence score, and an expert-verified chain-of-thought explanation. Draft rationales were first generated by GPT-4o and then refined by a professional agricultural engineer using standard descriptors such as lesion color, margin, and distribution. Using these data, we fine-tune the open vision–language model Qwen-2.5-VL-3B to jointly identify diseases and explain its reasoning in a way that mirrors expert thinking. On a 1,000-image test set, our model reaches 73.1% accuracy and produces explanations that align closely with human expertise. These results show that expert-verified reasoning supervision enhances both performance and interpretability, bringing us closer to transparent and trustworthy AI tools for sustainable agriculture.To support reproducibility and further research, the dataset and code are publicly available at https://github.com/hazzanabeel12-netizen/agrichain.

2025

pdf bib abs

Capitalizing on a vast amount of image-text data, large-scale vision-language pre-training has demonstrated remarkable zero-shot capabilities and has been utilized in several applications. However, models trained on general everyday web-crawled data often exhibit sub-optimal performance for specialized domains, likely due to domain shift. Recent works have tackled this problem for some domains (e.g., healthcare) by constructing domain-specialized image-text data. However, constructing a dedicated large-scale image-text dataset for sustainable areas of agriculture and livestock is still open to research. Further, this domain desires fine-grained feature learning due to the subtle nature of the downstream tasks (e.g., nutrient deficiency detection and livestock breed classification). To address this, we present AgriCLIP, a vision-language foundational model dedicated to the domain of agriculture and livestock. First, we propose a large-scale dataset named ALive that leverages a customized prompt generation strategy to overcome the scarcity of expert annotations. Our ALive dataset covers crops, livestock, and fishery, with around 600,000 image-text pairs. Second, we propose a training pipeline that integrates both contrastive and self-supervised learning to learn both global semantic and local fine-grained domain-specialized features. Experiments on a diverse set of 20 downstream tasks demonstrate the effectiveness of the AgriCLIP framework, achieving an absolute gain of 9.07% in terms of average zero-shot classification accuracy over the standard CLIP adaptation via domain-specialized ALive dataset. Our ALive dataset and code can be accessible on Github.

Co-authors

Venues

LREC2
COLING1

Fix author