Howard University-AI4PC at SemEval-2025 Task 1: Using GPT-4o and CLIP-ViLT to Decode Figurative Language Across Text and Images

Saurav Aryal, Lawal Abdulmujeeb


Abstract
Correctly identifying idiomatic expressions remains a major challenge in Natural Language Processing (NLP), as these expressions often have meanings that cannot be directly inferred from their individual words. The SemEval-2025 Task 1 introduces two subtasks, A and B, designed to test models’ ability to interpret idioms using multimodal data, including both text and images. This paper focuses on Subtask A, where the goal is to determine which among several images best represents the intended meaning of an idiomatic expression in a given sentence.To address this, we employed a two-stage approach. First, we used GPT-4o to analyze sentences, extracting relevant keywords and sentiments to better understand the idiomatic usage. This processed information was then passed to a CLIP-VIT model, which ranked the available images based on their relevance to the idiomatic expression. Our results showed that this approach performed significantly better than directly feeding sentences and idiomatic compounds into the models without preprocessing. Specifically, our method achieved a Top-1 accuracy of 0.67 in English, whereas performance in Portuguese was notably lower at 0.23. These findings highlight both the promise of multimodal approaches for idiom interpretation and the challenges posed by language-specific differences in model performance.
Anthology ID:
2025.semeval-1.241
Volume:
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri
Venues:
SemEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1842–1848
Language:
URL:
https://preview.aclanthology.org/transition-to-people-yaml/2025.semeval-1.241/
DOI:
Bibkey:
Cite (ACL):
Saurav Aryal and Lawal Abdulmujeeb. 2025. Howard University-AI4PC at SemEval-2025 Task 1: Using GPT-4o and CLIP-ViLT to Decode Figurative Language Across Text and Images. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1842–1848, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Howard University-AI4PC at SemEval-2025 Task 1: Using GPT-4o and CLIP-ViLT to Decode Figurative Language Across Text and Images (Aryal & Abdulmujeeb, SemEval 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/transition-to-people-yaml/2025.semeval-1.241.pdf