RelCLIP: Adapting Language-Image Pretraining for Visual Relationship Detection via Relational Contrastive Learning

Yi Zhu, Zhaoqing Zhu, Bingqian Lin, Xiaodan Liang, Feng Zhao, Jianzhuang Liu


Abstract
Conventional visual relationship detection models only use the numeric ids of relation labels for training, but ignore the semantic correlation between the labels, which leads to severe training biases and harms the generalization ability of representations. In this paper, we introduce compact language information of relation labels for regularizing the representation learning of visual relations. Specifically, we propose a simple yet effective visual Relationship prediction framework that transfers natural language knowledge learned from Contrastive Language-Image Pre-training (CLIP) models to enhance the relationship prediction, termed RelCLIP. Benefiting from the powerful visual-semantic alignment ability of CLIP at image level, we introduce a novel Relational Contrastive Learning (RCL) approach which explores relation-level visual-semantic alignment via learning to match cross-modal relational embeddings. By collaboratively learning the semantic coherence and discrepancy from relation triplets, the model can generate more discriminative and robust representations. Experimental results on the Visual Genome dataset show that RelCLIP achieves significant improvements over strong baselines under full (provide accurate labels) and distant supervision (provide noise labels), demonstrating its powerful generalization ability in learning relationship representations. Code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/RelCLIP.
Anthology ID:
2022.emnlp-main.317
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4800–4810
Language:
URL:
https://aclanthology.org/2022.emnlp-main.317
DOI:
10.18653/v1/2022.emnlp-main.317
Bibkey:
Cite (ACL):
Yi Zhu, Zhaoqing Zhu, Bingqian Lin, Xiaodan Liang, Feng Zhao, and Jianzhuang Liu. 2022. RelCLIP: Adapting Language-Image Pretraining for Visual Relationship Detection via Relational Contrastive Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4800–4810, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
RelCLIP: Adapting Language-Image Pretraining for Visual Relationship Detection via Relational Contrastive Learning (Zhu et al., EMNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/add_acl24_videos/2022.emnlp-main.317.pdf