Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Tong Zhang; Kuofeng Gao; Jiawang Bai; Leo Yu Zhang; Xin Yin; Zonghui Wang; Shouling Ji; Wenzhi Chen

Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

Tong Zhang, Kuofeng Gao, Jiawang Bai, Leo Yu Zhang, Xin Yin, Zonghui Wang, Shouling Ji, Wenzhi Chen

Abstract

Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process solely relies on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and detriment the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct the image-caption pairs, named OTCCLIP. We involve a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks to 0% in most cases. Also, compared to previous methods, OTCCLIPsignificantly improves CLIP’s zero-shot and linear probing performance trained on poisoned datasets.

Anthology ID:: 2025.emnlp-main.497
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9836–9849
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.497/
DOI:
Bibkey:
Cite (ACL):: Tong Zhang, Kuofeng Gao, Jiawang Bai, Leo Yu Zhang, Xin Yin, Zonghui Wang, Shouling Ji, and Wenzhi Chen. 2025. Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9836–9849, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.497.pdf
Checklist:: 2025.emnlp-main.497.checklist.pdf

PDF Cite Search Checklist Fix data