VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin
Abstract
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.- Anthology ID:
- 2024.emnlp-main.898
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 16061–16075
- Language:
- URL:
- https://preview.aclanthology.org/ingest_wac_2008/2024.emnlp-main.898/
- DOI:
- 10.18653/v1/2024.emnlp-main.898
- Cite (ACL):
- Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, and Lianwen Jin. 2024. VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16061–16075, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models (Wang et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/ingest_wac_2008/2024.emnlp-main.898.pdf