RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li
Abstract
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most of the state-of-the-art TVR methods learn image-to-video transfer learning based on the large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation cost. To this end, we propose to conduct efficient text-video Retrieval with a salient-and-correlated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics including temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from frozen CLIP backbone, which accentuates silent frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism which firstly selects top responsive visual patch and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that our RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient finetuning methods.- Anthology ID:
- 2024.findings-acl.427
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2024
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Lun-Wei Ku, Andre Martins, Vivek Srikumar
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7160–7174
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.427/
- DOI:
- 10.18653/v1/2024.findings-acl.427
- Cite (ACL):
- Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, and Ge Li. 2024. RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter. In Findings of the Association for Computational Linguistics: ACL 2024, pages 7160–7174, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter (Cao et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2024.findings-acl.427.pdf