Abstract
This is a system paper for the FigLang-2024 Multimodal Figurative Language Shared Task. Figurative language is generally represented through multiple modalities, facilitating the expression of complex and abstract ideas. With the popularity of various text-to-image tools, a large number of images containing metaphors or ironies are created. Traditional recognizing textual entailment has been extended to the task of understanding figurative language via visual entailment. However, existing pre-trained multimodal models in open domains often struggle with this task due to the intertwining of counterfactuals, human culture, and imagination. To bridge this gap, we propose FigCLIP, an end-to-end model based on CLIP and GPT-2, to identify multimodal figurative semantics and generate explanations. It employs a bidirectional fusion module with cross-attention and leverages explanations to promote the alignment of figurative image-text representations. Experimental results on the benchmark demonstrate the effectiveness of our method, achieving 70% F1-score, 67% F1@50-score and 50% F1@60-score. It outperforms GPT-4V, which has robust visual reasoning capabilities.- Anthology ID:
- 2024.figlang-1.13
- Volume:
- Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico (Hybrid)
- Editors:
- Debanjan Ghosh, Smaranda Muresan, Anna Feldman, Tuhin Chakrabarty, Emmy Liu
- Venues:
- Fig-Lang | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 92–98
- Language:
- URL:
- https://preview.aclanthology.org/remove-affiliations/2024.figlang-1.13/
- DOI:
- 10.18653/v1/2024.figlang-1.13
- Cite (ACL):
- Qihao Yang and Xuelin Wang. 2024. FigCLIP: A Generative Multimodal Model with Bidirectional Cross-attention for Understanding Figurative Language via Visual Entailment. In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), pages 92–98, Mexico City, Mexico (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- FigCLIP: A Generative Multimodal Model with Bidirectional Cross-attention for Understanding Figurative Language via Visual Entailment (Yang & Wang, Fig-Lang 2024)
- PDF:
- https://preview.aclanthology.org/remove-affiliations/2024.figlang-1.13.pdf