Can Video LLMs See Through Illusions? Video-Illusion QA Benchmark Dataset

Souto Ohira, Tosho Hirasawa, Mamoru Komachi


Abstract
Recent advances in multimodal learning have sparked growing interest in understanding how large vision-language models interpret optical illusions. While the behavior of image LLMs—which handle one image and text but not video input—on visual illusion images has been actively explored, research on their video counterparts remains limited. Video LLMs, which process sequential frames, are gaining prominence in areas such as robotics and autonomous driving. Understanding how they handle visual illusions over time is crucial for safety and may also reveal their potential as computational models of human cognition. To address this gap, we present the Video-Illusion QA Benchmark (VILQA), a novel video question answering (QA) benchmark mainly composed of carefully curated illusion videos that exhibit temporally driven perceptual phenomena. To the best of our knowledge, VILQA is the largest and most comprehensive benchmark for temporally-driven visual illusions. We evaluate several video LLMs on this benchmark from multiple perspectives. Some models were able to perceive visual illusions in a way similar to the general human experience and demonstrated an ability to resist illusions even more effectively than humans. The constructed dataset is available at https://github.com/SDS-NLP/VILQA.
Anthology ID:
2026.lrec-main.730
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
9291–9300
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.730/
DOI:
Bibkey:
Cite (ACL):
Souto Ohira, Tosho Hirasawa, and Mamoru Komachi. 2026. Can Video LLMs See Through Illusions? Video-Illusion QA Benchmark Dataset. International Conference on Language Resources and Evaluation, main:9291–9300.
Cite (Informal):
Can Video LLMs See Through Illusions? Video-Illusion QA Benchmark Dataset (Ohira et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.730.pdf