Abstract
Video-and-Language learning, such as video question answering or video captioning, is the next challenge in the deep learning society, as it pursues the way how human intelligence perceives everyday life. These tasks require the ability of multi-modal reasoning which is to handle both visual information and text information simultaneously across time. In this point of view, a cross-modality attention module that fuses video representation and text representation takes a critical role in most recent approaches. However, existing Video-and-Language models merely compute the attention weights without considering the different characteristics of video modality and text modality. Such na ̈ıve attention module hinders the current models to fully enjoy the strength of cross-modality. In this paper, we propose a novel Modality Alignment method that benefits the cross-modality attention module by guiding it to easily amalgamate multiple modalities. Specifically, we exploit Centered Kernel Alignment (CKA) which was originally proposed to measure the similarity between two deep representations. Our method directly optimizes CKA to make an alignment between video and text embedding representations, hence it aids the cross-modality attention module to combine information over different modalities. Experiments on real-world Video QA tasks demonstrate that our method outperforms conventional multi-modal methods significantly with +3.57% accuracy increment compared to the baseline in a popular benchmark dataset. Additionally, in a synthetic data environment, we show that learning the alignment with our method boosts the performance of the cross-modality attention.- Anthology ID:
- 2022.lrec-1.295
- Volume:
- Proceedings of the Thirteenth Language Resources and Evaluation Conference
- Month:
- June
- Year:
- 2022
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 2759–2770
- Language:
- URL:
- https://aclanthology.org/2022.lrec-1.295
- DOI:
- Cite (ACL):
- Hyeongu Yun, Yongil Kim, and Kyomin Jung. 2022. Modality Alignment between Deep Representations for Effective Video-and-Language Learning. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2759–2770, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Modality Alignment between Deep Representations for Effective Video-and-Language Learning (Yun et al., LREC 2022)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2022.lrec-1.295.pdf
- Data
- TVQA, TVQA+