Entity-level Cross-modal Learning Improves Multi-modal Machine Translation

Xin Huang, Jiajun Zhang, Chengqing Zong


Abstract
Multi-modal machine translation (MMT) aims at improving translation performance by incorporating visual information. Most of the studies leverage the visual information through integrating the global image features as auxiliary input or decoding by attending to relevant local regions of the image. However, this kind of usage of visual information makes it difficult to figure out how the visual modality helps and why it works. Inspired by the findings of (CITATION) that entities are most informative in the image, we propose an explicit entity-level cross-modal learning approach that aims to augment the entity representation. Specifically, the approach is framed as a reconstruction task that reconstructs the original textural input from multi-modal input in which entities are replaced with visual features. Then, a multi-task framework is employed to combine the translation task and the reconstruction task to make full use of cross-modal entity representation learning. The extensive experiments demonstrate that our approach can achieve comparable or even better performance than state-of-the-art models. Furthermore, our in-depth analysis shows how visual information improves translation.
Anthology ID:
2021.findings-emnlp.92
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1067–1080
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.92
DOI:
10.18653/v1/2021.findings-emnlp.92
Bibkey:
Cite (ACL):
Xin Huang, Jiajun Zhang, and Chengqing Zong. 2021. Entity-level Cross-modal Learning Improves Multi-modal Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1067–1080, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Entity-level Cross-modal Learning Improves Multi-modal Machine Translation (Huang et al., Findings 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2021.findings-emnlp.92.pdf
Video:
 https://preview.aclanthology.org/landing_page/2021.findings-emnlp.92.mp4
Data
Flickr30K EntitiesFlickr30k