Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Chaoya Jiang, Wei Ye, Haiyang Xu, Songfang Huang, Fei Huang, Shikun Zhang


Abstract
In this paper, we reconsider the problem of (partial) false negative samples from the Mutual Information (MI) Maximization perspective, the traditional contrastive loss (like InfoNCE loss) will equally push away the anchor of all positive samples and negative samples regardless of their possible semantic similarities. We theoretically show that InfoNCE loss will not only maximize the MI between the anchor and positive samples but minimize the MI between the anchor and false negative samples even though they share similar semantic which could provide a possible theoretical explanation for the observation of the existence of false negative samples in the cross-modal contrastive learning will decrease the downstream task performance of VLP models. Above analysis motivate us to propose the VLP model with a novel Semantic Awared Contrastive Learning framework named SACL where different negative samples are assigned with different contrastive weights according to the semantic similarity between them and the anchor.
Anthology ID:
2023.acl-long.819
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14660–14679
Language:
URL:
https://aclanthology.org/2023.acl-long.819
DOI:
10.18653/v1/2023.acl-long.819
Bibkey:
Cite (ACL):
Chaoya Jiang, Wei Ye, Haiyang Xu, Songfang Huang, Fei Huang, and Shikun Zhang. 2023. Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14660–14679, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation (Jiang et al., ACL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.819.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.819.mp4