Siru Miao
2026
Benchmarking the Fine-Grained Discriminability in Image-Text Retrieval via Controlled Contrastive Differences
Zhen Wang | Xi Zhou | Yating Yang | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar | Siru Miao
Findings of the Association for Computational Linguistics: ACL 2026
Zhen Wang | Xi Zhou | Yating Yang | Bo Ma | Lei Wang | Rui Dong | Azmat Anwar | Siru Miao
Findings of the Association for Computational Linguistics: ACL 2026
Existing cross-modal image-text retrieval models often retrieve samples with inconsistent details. To evaluate fine-grained discriminability, we introduce MSCOCO-CCD and Flickr30k-CCD, with three key features: (1) a two-level image content taxonomy for contrastive sample generation and fine-grained evaluation; (2) annotation of numerous contrastive samples, where each sample differs from the anchor by a controlled contrastive difference (CCD), with the specific type of difference labeled; (3) a fine-grained contrastive discrimination metric to assess the ability to distinguish fine-grained nuances. Extensive experiments demonstrate that contrastive samples can significantly degrade retrieval performance. Furthermore, fine-grained evaluation reveals that current models still struggle to effectively produce discriminative representations on certain feature types, such as entity emotion and scene attribute. Our datasets and related codes will be publicly released.