Do Image–Text Metrics Respect Semantic Invariances?

Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, M. Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth


Abstract
Reference-free image–to–text evaluators are now standard for scoring image–caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes: spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities: benign spatial edits and simple phrasing changes shift scores by (≈)6–9% on average, and for systems separated by just 0.7% these shifts can cause ranking flips in upto (∼)37% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.
Anthology ID:
2026.findings-acl.1948
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39089–39116
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1948/
DOI:
Bibkey:
Cite (ACL):
Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, M. Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, and Dan Roth. 2026. Do Image–Text Metrics Respect Semantic Invariances?. In Findings of the Association for Computational Linguistics: ACL 2026, pages 39089–39116, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Do Image–Text Metrics Respect Semantic Invariances? (Agarwal et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1948.pdf
Checklist:
 2026.findings-acl.1948.checklist.pdf