Multimodal Transformer for Multimodal Machine Translation

Shaowei Yao, Xiaojun Wan


Abstract
Multimodal Machine Translation (MMT) aims to introduce information from other modality, generally static images, to improve the translation quality. Previous works propose various incorporation methods, but most of them do not consider the relative importance of multiple modalities. Equally treating all modalities may encode too much useless information from less important modalities. In this paper, we introduce the multimodal self-attention in Transformer to solve the issues above in MMT. The proposed method learns the representation of images based on the text, which avoids encoding irrelevant information in images. Experiments and visualization analysis demonstrate that our model benefits from visual information and substantially outperforms previous works and competitive baselines in terms of various metrics.
Anthology ID:
2020.acl-main.400
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4346–4350
Language:
URL:
https://aclanthology.org/2020.acl-main.400
DOI:
10.18653/v1/2020.acl-main.400
Bibkey:
Cite (ACL):
Shaowei Yao and Xiaojun Wan. 2020. Multimodal Transformer for Multimodal Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4346–4350, Online. Association for Computational Linguistics.
Cite (Informal):
Multimodal Transformer for Multimodal Machine Translation (Yao & Wan, ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2020.acl-main.400.pdf
Video:
 http://slideslive.com/38929440
Data
Multi30K