Christoph Feichtenhofer
2021
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu
|
Gargi Ghosh
|
Po-Yao Huang
|
Prahal Arora
|
Masoumeh Aminzadeh
|
Christoph Feichtenhofer
|
Florian Metze
|
Luke Zettlemoyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu
|
Gargi Ghosh
|
Po-Yao Huang
|
Dmytro Okhonko
|
Armen Aghajanyan
|
Florian Metze
|
Luke Zettlemoyer
|
Christoph Feichtenhofer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.
Search
Co-authors
- Hu Xu 2
- Gargi Ghosh 2
- Po-Yao Huang 2
- Florian Metze 2
- Luke Zettlemoyer 2
- show all...