The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
Mingkai Tian, Guorong Li, Yuankai Qi, Anton Van Den Hengel, Qingming Huang
Abstract
Zero-shot video captioning requires that a model generate high-quality captions without human-annotated video-text pairs for training. State-of-the-art approaches to the problem leverage CLIP to extract video-informed text prompts to guide language models in generating captions. However, by using representations at a single granularity (e.g., noun phrases or full sentences), these methods tend to focus on one key aspect of the scene and build a caption that ignores the rest of the visual input. To address this issue, and generate more accurate and complete captions, we propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences. Moreover, we introduce a category-aware retrieval mechanism that models the distribution of natural language surrounding the specific topics, to promote prompt diversity while ensuring visual relevance. Extensive experiments on both in-domain and cross-domain settings demonstrate that the proposed method consistently outperforms state-of-the-art approaches.- Anthology ID:
- 2026.findings-eacl.98
- Volume:
- Findings of the Association for Computational Linguistics: EACL 2026
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1916–1929
- Language:
- URL:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.98/
- DOI:
- Cite (ACL):
- Mingkai Tian, Guorong Li, Yuankai Qi, Anton Van Den Hengel, and Qingming Huang. 2026. The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1916–1929, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning (Tian et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.98.pdf