WikiVideo: Article Generation from Multiple Videos

Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme


Abstract
We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events—from natural disasters to political elections—where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce , a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles’ claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
Anthology ID:
2026.findings-acl.3
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
45–70
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.3/
DOI:
Bibkey:
Cite (ACL):
Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, and Benjamin Van Durme. 2026. WikiVideo: Article Generation from Multiple Videos. In Findings of the Association for Computational Linguistics: ACL 2026, pages 45–70, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
WikiVideo: Article Generation from Multiple Videos (Martin et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.3.pdf
Checklist:
 2026.findings-acl.3.checklist.pdf