FigCLIP: A Generative Multimodal Model with Bidirectional Cross-attention for Understanding Figurative Language via Visual Entailment

Qihao Yang, Xuelin Wang


Abstract
This is a system paper for the FigLang-2024 Multimodal Figurative Language Shared Task. Figurative language is generally represented through multiple modalities, facilitating the expression of complex and abstract ideas. With the popularity of various text-to-image tools, a large number of images containing metaphors or ironies are created. Traditional recognizing textual entailment has been extended to the task of understanding figurative language via visual entailment. However, existing pre-trained multimodal models in open domains often struggle with this task due to the intertwining of counterfactuals, human culture, and imagination. To bridge this gap, we propose FigCLIP, an end-to-end model based on CLIP and GPT-2, to identify multimodal figurative semantics and generate explanations. It employs a bidirectional fusion module with cross-attention and leverages explanations to promote the alignment of figurative image-text representations. Experimental results on the benchmark demonstrate the effectiveness of our method, achieving 70% F1-score, 67% F1@50-score and 50% F1@60-score. It outperforms GPT-4V, which has robust visual reasoning capabilities.
Anthology ID:
2024.figlang-1.13
Volume:
Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico (Hybrid)
Editors:
Debanjan Ghosh, Smaranda Muresan, Anna Feldman, Tuhin Chakrabarty, Emmy Liu
Venues:
Fig-Lang | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
92–98
Language:
URL:
https://aclanthology.org/2024.figlang-1.13
DOI:
10.18653/v1/2024.figlang-1.13
Bibkey:
Cite (ACL):
Qihao Yang and Xuelin Wang. 2024. FigCLIP: A Generative Multimodal Model with Bidirectional Cross-attention for Understanding Figurative Language via Visual Entailment. In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), pages 92–98, Mexico City, Mexico (Hybrid). Association for Computational Linguistics.
Cite (Informal):
FigCLIP: A Generative Multimodal Model with Bidirectional Cross-attention for Understanding Figurative Language via Visual Entailment (Yang & Wang, Fig-Lang-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2024.figlang-1.13.pdf