Xiangzheng Kong
2026
From Detection to Understanding: Multi-Turn Reasoning for Video Misinformation Analysis
Zhi Zeng | Jiaying Wu | Minnan Luo | Di Zhang | Yifei Yang | Xiangzheng Kong | Herun Wan | Zihan Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhi Zeng | Jiaying Wu | Minnan Luo | Di Zhang | Yifei Yang | Xiangzheng Kong | Herun Wan | Zihan Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Video misinformation detection is often approached as a binary veracity classification problem, overlooking the complex reasoning required to explain how and why content misleads. Existing benchmarks fail to capture the diversity of manipulation strategies, such as AI-generated edits and out-of-context manipulation, and do not evaluate whether models can provide process-level justifications for their judgments. We address these limitations with MisVideoQA, a multi-turn benchmark designed to assess comprehensive understanding and reasoning in video misinformation analysis. MisVideoQA covers 12 fine-grained deception categories and evaluates models along six dimensions, progressing from perceptual attribution to intent and persuasion analysis. Recognizing that standard MLLMs struggle to sustain such structured, evidence-based deduction, we propose MisAgent, a Delphi-inspired multi-agent framework in which specialized agents collaboratively integrate multimodal cues with external evidence. Experimental results show that state-of-the-art multimodal large language models perform poorly on MisVideoQA, while MisAgent consistently improves reasoning accuracy and explanation quality. Together, our benchmark and framework establish a unified foundation for reliable, interpretable, and evidence-grounded video misinformation analysis.
2025
IMOL: Incomplete-Modality-Tolerant Learning for Multi-Domain Fake News Video Detection
Zhi Zeng | Jiaying Wu | Minnan Luo | Herun Wan | Xiangzheng Kong | Zihan Ma | Guang Dai | Qinghua Zheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhi Zeng | Jiaying Wu | Minnan Luo | Herun Wan | Xiangzheng Kong | Zihan Ma | Guang Dai | Qinghua Zheng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While recent advances in fake news video detection have shown promising potential, existing approaches typically (1) focus on a specific domain (e.g., politics) and (2) assume the availability of multiple modalities, including video, audio, description texts, and related images. However, these methods struggle to generalize to real-world scenarios, where questionable information spans diverse domains and is often modality-incomplete due to factors such as upload degradation or missing metadata. To address these challenges, we introduce two real-world multi-domain news video benchmarks that reflect modality incompleteness and propose IMOL, an incomplete-modality-tolerant learning framework for multi-domain fake news video detection. Inspired by cognitive theories suggesting that humans infer missing modalities through cross-modal guidance and retrieve relevant knowledge from memory for reference, IMOL employs a hierarchical transferable information integration strategy. This consists of two key phases: (1) leveraging cross-modal consistency to reconstruct missing modalities and (2) refining sample-level transferable knowledge through cross-sample associative reasoning. Extensive experiments demonstrate that IMOL significantly enhances the performance and robustness of multi-domain fake news video detection while effectively generalizing to unseen domains under incomplete modality conditions.