Abstract
Text-video retrieval focuses on two aspects: cross-modality interaction and video-language encoding. Currently, the mainstream approach is to train a joint embedding space for multimodal interactions. However, there are structural and semantic differences between text and video, making this approach challenging for fine-grained understanding. In order to solve this, we propose an end-to-end graph-based hierarchical aggregation network for text-video retrieval according to the hierarchy possessed by text and video. We design a token-level weighted network to refine intra-modality representations and construct a graph-based message passing attention network for global-local alignment across modality. We conduct experiments on the public datasets MSR-VTT-9K, MSR-VTT-7K and MSVD, and achieve Recall@1 of 73.0%, 65.6%, and 64.0% , which is 25.7%, 16.5%, and 14.2% better than the current state-of-the-art model.- Anthology ID:
- 2022.emnlp-main.374
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5547–5557
- Language:
- URL:
- https://aclanthology.org/2022.emnlp-main.374
- DOI:
- 10.18653/v1/2022.emnlp-main.374
- Cite (ACL):
- Yahan Yu, Bojie Hu, and Yu Li. 2022. GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5547–5557, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval (Yu et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.emnlp-main.374.pdf