Supervised Visual Attention for Multimodal Neural Machine Translation

Tetsuro Nishihara, Akihiro Tamura, Takashi Ninomiya, Yutaro Omote, Hideki Nakayama


Abstract
This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).
Anthology ID:
2020.coling-main.380
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
4304–4314
Language:
URL:
https://aclanthology.org/2020.coling-main.380
DOI:
10.18653/v1/2020.coling-main.380
Bibkey:
Cite (ACL):
Tetsuro Nishihara, Akihiro Tamura, Takashi Ninomiya, Yutaro Omote, and Hideki Nakayama. 2020. Supervised Visual Attention for Multimodal Neural Machine Translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4304–4314, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Supervised Visual Attention for Multimodal Neural Machine Translation (Nishihara et al., COLING 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.coling-main.380.pdf
Data
Flickr30K EntitiesFlickr30k