The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

Mingkai Tian; Guorong Li; Yuankai Qi; Anton Van Den Hengel; Qingming Huang

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning

Mingkai Tian, Guorong Li, Yuankai Qi, Anton Van Den Hengel, Qingming Huang

Abstract

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract video-informed text prompts to guide language models in generating captions. However, by using representations at a single granularity (e.g., noun phrases or full sentences), these methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics, to promote prompt diversity while ensuring visual relevance. Extensive experiments on both in-domain and cross-domain settings demonstrate that the proposed method consistently outperforms state-of-the-art approaches.

Anthology ID:: 2026.findings-eacl.98
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1916–1929
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.98/
DOI:
Bibkey:
Cite (ACL):: Mingkai Tian, Guorong Li, Yuankai Qi, Anton Van Den Hengel, and Qingming Huang. 2026. The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1916–1929, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning (Tian et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.98.pdf
Checklist:: 2026.findings-eacl.98.checklist.pdf

PDF Cite Search Checklist Fix data