ELIOT: Zero-Shot Video-Text Retrieval through Relevance-Boosted Captioning and Structural Information Extraction

Xuye Liu, Yimu Wang, Jian Zhao


Abstract
Recent advances in video-text retrieval (VTR) have largely relied on supervised learning and fine-tuning. In this paper, we introduce , a novel zero-shot VTR framework that leverages off-the-shelf video captioners, large language models (LLMs), and text retrieval methods—entirely without additional training or annotated data. Due to the limited power of captioning methods, the captions often miss important content in the video, resulting in unsatisfactory retrieval performance. To translate more information into video captions, we first generates initial captions for videos, then enhances them using a relevance-boosted captioning strategy powered by LLMs, enriching video descriptions with salient details. To further emphasize key content, we propose structural information extraction, organizing visual elements such as objects, events, and attributes into structured templates, further boosting the retrieval performance. Benefiting from the enriched captions and structuralized information, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of over existing fine-tuned and pretraining methods without any data. They also show that the enriched captions capture key details from the video with minimal noise. Code and data will be released to facilitate future research.
Anthology ID:
2025.naacl-srw.37
Volume:
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Month:
April
Year:
2025
Address:
Albuquerque, USA
Editors:
Abteen Ebrahimi, Samar Haider, Emmy Liu, Sammar Haider, Maria Leonor Pacheco, Shira Wein
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
381–391
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.37/
DOI:
Bibkey:
Cite (ACL):
Xuye Liu, Yimu Wang, and Jian Zhao. 2025. ELIOT: Zero-Shot Video-Text Retrieval through Relevance-Boosted Captioning and Structural Information Extraction. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop), pages 381–391, Albuquerque, USA. Association for Computational Linguistics.
Cite (Informal):
ELIOT: Zero-Shot Video-Text Retrieval through Relevance-Boosted Captioning and Structural Information Extraction (Liu et al., NAACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.naacl-srw.37.pdf