From What Is Said to Why It Is Framed: Intent-Aware News Video Understanding

Xiangzheng Kong, Minnan Luo, Wenya Wang, Jiaying Wu, Zhi Zeng, Guang Dai


Abstract
Short-form news videos increasingly shape public perception through strategic framing, yet existing verification methods largely overlook the communicative intent underlying such content. By emphasizing surface semantics, current models struggle to separate stylistic presentation from factual evidence, which leads to shortcut learning and brittle generalization. To address this limitation, we propose the Origin–Objective–Means (OOM) framework, a theory-grounded representation of communicative intent that captures creator stance, audience need activation, and communication strategy. We validate OOM through large-scale human annotation, revealing distinct and consistent lexical and structural patterns across intent dimensions. Building on this representation, we operationalize intent as an explicit semantic condition rather than a prediction target. Concretely, we introduce Intent-Guided Prompting (IGP) to condition LLM reasoning and intent-conditioned multimodal detection framework (ICMD), which injects intent into multimodal detectors via feature-wise modulation. Experiments on FakeSV and FakeTT show that modeling intent as an intermediate condition consistently improves accuracy and robustness across diverse vision–language backbones, while substantially reducing reliance on spurious stylistic correlations.
Anthology ID:
2026.findings-acl.1945
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39039–39050
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1945/
DOI:
Bibkey:
Cite (ACL):
Xiangzheng Kong, Minnan Luo, Wenya Wang, Jiaying Wu, Zhi Zeng, and Guang Dai. 2026. From What Is Said to Why It Is Framed: Intent-Aware News Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 39039–39050, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
From What Is Said to Why It Is Framed: Intent-Aware News Video Understanding (Kong et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1945.pdf
Checklist:
 2026.findings-acl.1945.checklist.pdf