Abstract
In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.- Anthology ID:
- D19-6605
- Volume:
- Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)
- Month:
- November
- Year:
- 2019
- Address:
- Hong Kong, China
- Editors:
- James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, Arpit Mittal
- Venue:
- WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27–33
- Language:
- URL:
- https://aclanthology.org/D19-6605
- DOI:
- 10.18653/v1/D19-6605
- Cite (ACL):
- Alireza Mohammadshahi, Rémi Lebret, and Karl Aberer. 2019. Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), pages 27–33, Hong Kong, China. Association for Computational Linguistics.
- Cite (Informal):
- Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task (Mohammadshahi et al., 2019)
- PDF:
- https://preview.aclanthology.org/improve-issue-templates/D19-6605.pdf
- Code
- alirezamshi/AME-CMR
- Data
- MS COCO