Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye; Kuicai Dong; Xie Zhifei; Yuxin Hu; Yihang Yin; Shurui Huang; Shikai Dong; Chen Zhang; Jianzhu Bao; Shuicheng Yan

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Fangda Ye, Kuicai Dong, Xie Zhifei, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Chen Zhang, Jianzhu Bao, Shuicheng Yan

Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M²LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. It enables unified multimodal assessment, fair comparison, and accessible evaluation without commercial APIs. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap. Our code is available at https://github.com/fangda-ye/Deep-Report.

Anthology ID:: 2026.acl-long.1909
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 41137–41177
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1909/
DOI:
Bibkey:
Cite (ACL):: Fangda Ye, Kuicai Dong, Xie Zhifei, Yuxin Hu, Yihang Yin, Shurui Huang, Shikai Dong, Chen Zhang, Jianzhu Bao, and Shuicheng Yan. 2026. Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 41137–41177, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation (Ye et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1909.pdf
Checklist:: 2026.acl-long.1909.checklist.pdf

PDF Cite Search Checklist Fix data