Now They See It, Now They Don’t: Multimodal Reward Models Exhibit Unreliability in Physical World Constraints

Sadaf Ghaffari, Nikhil Krishnaswamy


Abstract
Generative AI systems, especially those driven by autoregressive and diffusion-based models, are known to struggle with spatial reasoning. As such, it becomes critical to understand how humans regard those failure modes. In this paper, we examine how humans judge different types of errors in images generated by a text-to-image model. We curated prompts that described common household objects with variance in number, spatial relations, and orientations, and generated a variety of images using each prompt. Humans observed pairs of images generated using the same prompt and answered a set of systematic questions about each image. Survey results showed that incorrect spatial *orientation* regularly emerges as a reason that the generated images do not accurately represent the prompt. We further investigated how RLHF-based multimodal reward models score prompt-image alignment over the same data, and whether they can reliably distinguish the better image in a pairwise setting, as humans do. We find that even though a general cross-task reward model may output alignment scores that accord with those of humans, its reasoning traces are flawed with respect to spatial orientational and relational indicators—the very factors that human annotators rated as the most consequential errors in generated images. Our results show that human annotators regard spatial reasoning errors as highly impactful on the correctness of generated images, and undermine the reliability of multimodal reward model scores as a baseline for evaluating image quality.
Anthology ID:
2026.conll-main.20
Volume:
Proceedings of the 30th Conference on Computational Natural Language Learning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Claire Bonial, Yevgeni Berzak
Venues:
CoNLL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
344–357
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.20/
DOI:
Bibkey:
Cite (ACL):
Sadaf Ghaffari and Nikhil Krishnaswamy. 2026. Now They See It, Now They Don’t: Multimodal Reward Models Exhibit Unreliability in Physical World Constraints. In Proceedings of the 30th Conference on Computational Natural Language Learning, pages 344–357, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Now They See It, Now They Don’t: Multimodal Reward Models Exhibit Unreliability in Physical World Constraints (Ghaffari & Krishnaswamy, CoNLL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.conll-main.20.pdf