Fangfang Yuan
2023
Mulan: A Multi-Level Alignment Model for Video Question Answering
Yu Fu
|
Cong Cao
|
Yuling Yang
|
Yuhai Lu
|
Fangfang Yuan
|
Dakui Wang
|
Yanbing Liu
Findings of the Association for Computational Linguistics: EMNLP 2023
Video Question Answering (VideoQA) aims to answer questions about the visual content of a video. Current methods mainly focus on improving joint representations of video and text. However, these methods pay little attention to the fine-grained semantic interaction between video and text. In this paper, we propose Mulan: a Multi-Level Alignment Model for Video Question Answering, which establishes alignment between visual and textual modalities at the object-level, frame-level, and video-level. Specifically, for object-level alignment, we propose a mask-guided visual feature encoding method and a visual-guided text description method to learn fine-grained spatial information. For frame-level alignment, we introduce the use of visual features from individual frames, combined with a caption generator, to learn overall spatial information within the scene. For video-level alignment, we propose an expandable ordinal prompt for textual descriptions, combined with visual features, to learn temporal information. Experimental results show that our method outperforms the state-of-the-art methods, even when utilizing the smallest amount of extra visual-language pre-training data and a reduced number of trainable parameters.
Search
Co-authors
- Yu Fu 1
- Cong Cao 1
- Yuling Yang 1
- Yuhai Lu 1
- Dakui Wang 1
- show all...