Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

Po-Yao Huang; Xiaojun Chang; Alexander G. Hauptmann

doi:10.18653/v1/D19-1154

Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations

Po-Yao Huang, Xiaojun Chang, Alexander Hauptmann

Abstract

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.

Anthology ID:: D19-1154
Volume:: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:: EMNLP | IJCNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1461–1467
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/D19-1154/
DOI:: 10.18653/v1/D19-1154
Bibkey:
Cite (ACL):: Po-Yao Huang, Xiaojun Chang, and Alexander Hauptmann. 2019. Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1461–1467, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations (Huang et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/D19-1154.pdf

PDF Cite Search Fix data