Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang; Xin Li; Lidong Bing

doi:10.18653/v1/2023.emnlp-demo.49

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Abstract

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual & audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual & audio encoders with LLM’s embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

Anthology ID:: 2023.emnlp-demo.49
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Yansong Feng, Els Lefever
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 543–553
Language:
URL:: https://preview.aclanthology.org/moar-dois/2023.emnlp-demo.49/
DOI:: 10.18653/v1/2023.emnlp-demo.49
Bibkey:
Cite (ACL):: Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore. Association for Computational Linguistics.
Cite (Informal):: Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (Zhang et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/moar-dois/2023.emnlp-demo.49.pdf
Video:: https://preview.aclanthology.org/moar-dois/2023.emnlp-demo.49.mp4

PDF Cite Search Video Fix data