Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer

Pranav Arora, Selen Pehlivan, Jorma Laaksonen


Abstract
The rapid proliferation of multimedia content has necessitated the development of effective multimodal video retrieval systems. Multimodal video retrieval is a non-trivial task involving retrieval of relevant information across different modalities, such as text, audio, and visual. This work aims to improve multimodal retrieval by guiding the creation of a shared embedding space with task-specific contrastive loss functions. An important aspect of our work is to propose a model that learns retrieval cues for the textual query from multiple modalities both separately and jointly within a hierarchical architecture that can be flexibly extended and fine-tuned for any number of modalities. To this end, the loss functions and the architectural design of the model are developed with a strong focus on increasing the mutual information between the textual and cross-modal representations. The proposed approach is quantitatively evaluated on the MSR-VTT and YouCook2 text-to-video retrieval benchmark datasets. The results showcase that the approach not only holds its own against state-of-the-art methods, but also outperforms them in a number of scenarios, with a notable relative improvements from baseline in R@1, R@5 and R@10 metrics.
Anthology ID:
2024.lrec-main.1374
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15823–15834
Language:
URL:
https://aclanthology.org/2024.lrec-main.1374
DOI:
Bibkey:
Cite (ACL):
Pranav Arora, Selen Pehlivan, and Jorma Laaksonen. 2024. Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15823–15834, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Text-to-Multimodal Retrieval with Bimodal Input Fusion in Shared Cross-Modal Transformer (Arora et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2024.lrec-main.1374.pdf