CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Xingcheng Zhou; Hao Guo; Rui Song; Walter Zimmer; Mingyu Liu; Andr\'e Schamschurko; Hu Cao; Alois Knoll

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu, Andr\'e Schamschurko, Hu Cao, Alois Knoll

Abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, which leverages the semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.

Anthology ID:: 2026.findings-acl.1089
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21665–21684
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1089/
DOI:
Bibkey:
Cite (ACL):: Xingcheng Zhou, Hao Guo, Rui Song, Walter Zimmer, Mingyu Liu, Andr\'e Schamschurko, Hu Cao, and Alois Knoll. 2026. CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21665–21684, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs (Zhou et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1089.pdf
Checklist:: 2026.findings-acl.1089.checklist.pdf

PDF Cite Search Checklist Fix data