Mushi Wang
2025
DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models
Yimu Wang
|
Shuai Yuan
|
Bo Xue
|
Xiangru Jian
|
Wei Pang
|
Mushi Wang
|
Ning Yu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent progress in video-text retrieval has been driven largely by advancements in model architectures and training strategies. However, the representation learning capabilities of video-text retrieval models remain constrained by low-quality and limited training data annotations. To address this issue, we present a novel Video-Text Retrieval Paradigm with Relevance-based Augmentation, namely dReAm, which enhances video and text data using large foundation models to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more robust augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). To further enrich video and text information, we propose a relevance-based augmentation method, where LLMs and VGMs generate and integrate new relevant information into the original data. Leveraging this enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of dReAm over existing methods. Code will be available upon acceptance.