ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan; Yujia Zhang; Michael Kampffmeyer; Xiaoguang Zhao

doi:10.18653/v1/2025.findings-emnlp.28

ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan, Yujia Zhang, Michael Kampffmeyer, Xiaoguang Zhao

Abstract

Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid, a hierarchical structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. We will release all code and checkpoints to facilitate further research.

Anthology ID:: 2025.findings-emnlp.28
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 519–533
Language:
URL:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.28/
DOI:: 10.18653/v1/2025.findings-emnlp.28
Bibkey:
Cite (ACL):: Yi Pan, Yujia Zhang, Michael Kampffmeyer, and Xiaoguang Zhao. 2025. ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 519–533, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval (Pan et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.28.pdf
Checklist:: 2025.findings-emnlp.28.checklist.pdf

PDF Cite Search Checklist Fix data