Hao Guo
2026
CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs
Xingcheng Zhou | Hao Guo | Rui Song | Walter Zimmer | Mingyu Liu | Andr\'e Schamschurko | Hu Cao | Alois Knoll
Findings of the Association for Computational Linguistics: ACL 2026
Xingcheng Zhou | Hao Guo | Rui Song | Walter Zimmer | Mingyu Liu | Andr\'e Schamschurko | Hu Cao | Alois Knoll
Findings of the Association for Computational Linguistics: ACL 2026
Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, which leverages the semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.
2024
Event-Radar: Event-driven Multi-View Learning for Multimodal Fake News Detection
Zihan Ma | Minnan Luo | Hao Guo | Zhi Zeng | Yiran Hao | Xiang Zhao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zihan Ma | Minnan Luo | Hao Guo | Zhi Zeng | Yiran Hao | Xiang Zhao
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The swift detection of multimedia fake news has emerged as a crucial task in combating malicious propaganda and safeguarding the security of the online environment. While existing methods have achieved commendable results in modeling entity-level inconsistency, addressing event-level inconsistency following the inherent subject-predicate logic of news and robustly learning news representations from poor-quality news samples remain two challenges. In this paper, we propose an Event-diven fake news detection framework (Event-Radar) based on multi-view learning, which integrates visual manipulation, textual emotion and multimodal inconsistency at event-level for fake news detection. Specifically, leveraging the capability of graph structures to capture interactions between events and parameters, Event-Radar captures event-level multimodal inconsistency by constructing an event graph that includes multimodal entity subject-predicate logic. Additionally, to mitigate the interference of poor-quality news, Event-Radar introduces a multi-view fusion mechanism, learning comprehensive and robust representations by computing the credibility of each view as a clue, thereby detecting fake news. Extensive experiments demonstrate that Event-Radar achieves outstanding performance on three large-scale fake news detection benchmarks. Our studies also confirm that Event-Radar exhibits strong robustness, providing a paradigm for detecting fake news from noisy news samples.
2023
NORMSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly
Yi Fung | Tuhin Chakrabarty | Hao Guo | Owen Rambow | Smaranda Muresan | Heng Ji
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Yi Fung | Tuhin Chakrabarty | Hao Guo | Owen Rambow | Smaranda Muresan | Heng Ji
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Knowledge of norms is needed to understand and reason about acceptable behavior in human communication and interactions across sociocultural scenarios. Most computational research on norms has focused on a single culture, and manually built datasets, from non-conversational settings. We address these limitations by proposing a new framework, NormSage, to automatically extract culture-specific norms from multi-lingual conversations. NormSage uses GPT-3 prompting to 1) extract candidate norms directly from conversations and 2) provide explainable self-verification to ensure correctness and relevance. Comprehensive empirical results show the promise of our approach to extract high-quality culture-aware norms from multi-lingual conversations (English and Chinese), across several quality metrics. Further, our relevance verification can be extended to assess the adherence and violation of any norm with respect to a conversation on-the-fly, along with textual explanation. NormSage achieves an AUC of 94.6% in this grounding setup, with generated explanations matching human-written quality.