Do Image–Text Metrics Respect Semantic Invariances?
Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, M. Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth
Abstract
Reference-free image–to–text evaluators are now standard for scoring image–caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes: spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities: benign spatial edits and simple phrasing changes shift scores by (≈)6–9% on average, and for systems separated by just 0.7% these shifts can cause ranking flips in upto (∼)37% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.- Anthology ID:
- 2026.findings-acl.1948
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 39089–39116
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1948/
- DOI:
- Cite (ACL):
- Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, M. Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, and Dan Roth. 2026. Do Image–Text Metrics Respect Semantic Invariances?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 39089–39116, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Do Image–Text Metrics Respect Semantic Invariances? (Agarwal et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1948.pdf