Guorong Li

2026

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
Mingkai Tian | Guorong Li | Yuankai Qi | Anton Van Den Hengel | Qingming Huang
Findings of the Association for Computational Linguistics: EACL 2026

Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract video-informed text prompts to guide language models in generating captions. However, by using representations at a single granularity (e.g., noun phrases or full sentences), these methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics, to promote prompt diversity while ensuring visual relevance. Extensive experiments on both in-domain and cross-domain settings demonstrate that the proposed method consistently outperforms state-of-the-art approaches.

Co-authors

Venues

Findings1

Fix author