CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar; Syrielle Montariol; Angelika Romanou; Beatriz Borges; Irina Rish; Antoine Bosselut

CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments

Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut

Abstract

Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.

Anthology ID:: 2025.emnlp-main.1379
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 27098–27139
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1379/
DOI:
Bibkey:
Cite (ACL):: Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, and Antoine Bosselut. 2025. CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27098–27139, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CAVE : Detecting and Explaining Commonsense Anomalies in Visual Environments (Bhagwatkar et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1379.pdf
Checklist:: 2025.emnlp-main.1379.checklist.pdf

PDF Cite Search Checklist Fix data