David Herel

2026

We introduce Comic Visual Question Answering (ComicVQA), a comics-based benchmark for evaluating MLLMs on visual reasoning. ComicVQA comprises of (i) Missing Panel Prediction, testing fine-grained visual grounding and (ii) Panel Sorting, which evaluates sequential narrative understanding. Proprietary models achieve up to 62.6% on Missing Panel Prediction and 46.4% on Panel Sorting, whereas open-source models reach only 47.7% and 26.9%, respectively. In contrast, human annotators achieve over 83% accuracy on both tasks, revealing a large gap between current models and human-level multimodal understanding in comics. Through controlled ordering ablations and a detailed error taxonomy, we show that current MLLMs rely primarily on coarse temporal cues and struggle with fine-grained visual reasoning. These findings demonstrate ComicVQA as a diagnostic benchmark for advancing multimodal visual reasoning in comics.

Co-authors

Venues

Findings1

Fix author