Siru Miao

2026

Existing cross-modal image-text retrieval models often retrieve samples with inconsistent details. To evaluate fine-grained discriminability, we introduce MSCOCO-CCD and Flickr30k-CCD, with three key features: (1) a two-level image content taxonomy for contrastive sample generation and fine-grained evaluation; (2) annotation of numerous contrastive samples, where each sample differs from the anchor by a controlled contrastive difference (CCD), with the specific type of difference labeled; (3) a fine-grained contrastive discrimination metric to assess the ability to distinguish fine-grained nuances. Extensive experiments demonstrate that contrastive samples can significantly degrade retrieval performance. Furthermore, fine-grained evaluation reveals that current models still struggle to effectively produce discriminative representations on certain feature types, such as entity emotion and scene attribute. Our datasets and related codes will be publicly released.

Co-authors

Yating Yang 1

Xi Zhou 1

Venues

Findings1

Fix author