University of Hildesheim at SemEval-2023 Task 1: Combining Pre-trained Multimodal and Generative Models for Image Disambiguation

Sebastian Diem; Chan Jong Im; Thomas Mandl

doi:10.18653/v1/2023.semeval-1.18

University of Hildesheim at SemEval-2023 Task 1: Combining Pre-trained Multimodal and Generative Models for Image Disambiguation

Sebastian Diem, Chan Jong Im, Thomas Mandl

Abstract

Multimodal ambiguity is a challenge for understanding text and images. Large pre-trained models have reached a high level of quality already. This paper presents an implementation for solving a image disambiguation task relying solely on the knowledge captured in multimodal and language models. Within the task 1 of SemEval 2023 (Visual Word Sense Disambiguation), this approach managed to achieve an MRR of 0.738 using CLIP-Large and the OPT model for generating text. Applying a generative model to create more text given a phrase with an ambiguous word leads to an improvement of our results. The performance gain from a bigger language model is larger than the performance gain from using the lager CLIP model.

Anthology ID:: 2023.semeval-1.18
Volume:: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Atul Kr. Ojha, A. Seza Doğruöz, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, Elisa Sartori
Venue:: SemEval
SIG:: SIGLEX
Publisher:: Association for Computational Linguistics
Note:
Pages:: 130–135
Language:
URL:: https://preview.aclanthology.org/icon-24-ingestion/2023.semeval-1.18/
DOI:: 10.18653/v1/2023.semeval-1.18
Bibkey:
Cite (ACL):: Sebastian Diem, Chan Jong Im, and Thomas Mandl. 2023. University of Hildesheim at SemEval-2023 Task 1: Combining Pre-trained Multimodal and Generative Models for Image Disambiguation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 130–135, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: University of Hildesheim at SemEval-2023 Task 1: Combining Pre-trained Multimodal and Generative Models for Image Disambiguation (Diem et al., SemEval 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/icon-24-ingestion/2023.semeval-1.18.pdf

PDF Search Fix data