Feature-level Incongruence Reduction for Multimodal Translation

Zhifeng Li, Yu Hong, Yuchen Pan, Jian Tang, Jianmin Yao, Guodong Zhou


Abstract
Caption translation aims to translate image annotations (captions for short). Recently, Multimodal Neural Machine Translation (MNMT) has been explored as the essential solution. Besides of linguistic features in captions, MNMT allows visual(image) features to be used. The integration of multimodal features reinforces the semantic representation and considerably improves translation performance. However, MNMT suffers from the incongruence between visual and linguistic features. To overcome the problem, we propose to extend MNMT architecture with a harmonization network, which harmonizes multimodal features(linguistic and visual features)by unidirectional modal space conversion. It enables multimodal translation to be carried out in a seemingly monomodal translation pipeline. We experiment on the golden Multi30k-16 and 17. Experimental results show that, compared to the baseline,the proposed method yields the improvements of 2.2% BLEU for the scenario of translating English captions into German (En→De) at best,7.6% for the case of English-to-French translation(En→Fr) and 1.5% for English-to-Czech(En→Cz). The utilization of harmonization network leads to the competitive performance to the-state-of-the-art.
Anthology ID:
2021.alvr-1.1
Volume:
Proceedings of the Second Workshop on Advances in Language and Vision Research
Month:
June
Year:
2021
Address:
Online
Editors:
Xin, Ronghang Hu, Drew Hudson, Tsu-Jui Fu, Marcus Rohrbach, Daniel Fried
Venue:
ALVR
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/2021.alvr-1.1
DOI:
10.18653/v1/2021.alvr-1.1
Bibkey:
Cite (ACL):
Zhifeng Li, Yu Hong, Yuchen Pan, Jian Tang, Jianmin Yao, and Guodong Zhou. 2021. Feature-level Incongruence Reduction for Multimodal Translation. In Proceedings of the Second Workshop on Advances in Language and Vision Research, pages 1–10, Online. Association for Computational Linguistics.
Cite (Informal):
Feature-level Incongruence Reduction for Multimodal Translation (Li et al., ALVR 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2021.alvr-1.1.pdf
Data
MS COCOMulti30K