Leveraging Visual Scene Graph to Enhance Translation Quality in Multimodal Machine Translation

Ali Hatami, Mihael Arcan, Paul Buitelaar


Abstract
Despite significant advancements in Multimodal Machine Translation, understanding and effectively utilising visual scenes within multimodal models remains a complex challenge. Extracting comprehensive and relevant visual features requires extensive and detailed input data to ensure the model accurately captures objects, their attributes, and relationships within a scene. In this paper, we explore using visual scene graphs extracted from images to enhance the performance of translation models. We investigate this approach for integrating Visual Scene Graph information into translation models, focusing on representing this information in a semantic structure rather than relying on raw image data. The performance of our approach was evaluated on the Multi30K dataset for English into German, French, and Czech translations using BLEU, chrF2, TER and COMET metrics. Our results demonstrate that utilising visual scene graph information improves translation performance. Using information on semantic structure can improve the multimodal baseline model, leading to better contextual understanding and translation accuracy.
Anthology ID:
2025.mtsummit-1.27
Volume:
Proceedings of Machine Translation Summit XX: Volume 1
Month:
June
Year:
2025
Address:
Geneva, Switzerland
Editors:
Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc
Venue:
MTSummit
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
353–364
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.27/
DOI:
Bibkey:
Cite (ACL):
Ali Hatami, Mihael Arcan, and Paul Buitelaar. 2025. Leveraging Visual Scene Graph to Enhance Translation Quality in Multimodal Machine Translation. In Proceedings of Machine Translation Summit XX: Volume 1, pages 353–364, Geneva, Switzerland. European Association for Machine Translation.
Cite (Informal):
Leveraging Visual Scene Graph to Enhance Translation Quality in Multimodal Machine Translation (Hatami et al., MTSummit 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.27.pdf