Supervised Visual Attention for Multimodal Neural Machine Translation

Tetsuro Nishihara; Akihiro Tamura; Takashi Ninomiya; Yutaro Omote; Hideki Nakayama

doi:10.18653/v1/2020.coling-main.380

Supervised Visual Attention for Multimodal Neural Machine Translation

Tetsuro Nishihara, Akihiro Tamura, Takashi Ninomiya, Yutaro Omote, Hideki Nakayama

Abstract

This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).

Anthology ID:: 2020.coling-main.380
Volume:: Proceedings of the 28th International Conference on Computational Linguistics
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Donia Scott, Nuria Bel, Chengqing Zong
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 4304–4314
Language:
URL:: https://aclanthology.org/2020.coling-main.380
DOI:: 10.18653/v1/2020.coling-main.380
Bibkey:
Cite (ACL):: Tetsuro Nishihara, Akihiro Tamura, Takashi Ninomiya, Yutaro Omote, and Hideki Nakayama. 2020. Supervised Visual Attention for Multimodal Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4304–4314, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):: Supervised Visual Attention for Multimodal Neural Machine Translation (Nishihara et al., COLING 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/naacl-24-ws-corrections/2020.coling-main.380.pdf
Data: Flickr30K Entities, Flickr30k, Multi30K

PDF Search