REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

FuLin Shi, Wenyi Xiao, Leilei Gan, Liang Ding, Binchen


Abstract
Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static Question Answering (QA) pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a reinforcement-guided visual reasoning framework for element-level text-to-image alignment evaluation. Adopting a structured ''grounding–reasoning–conclusion'' paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization (GRPO) using a multi-dimensional reward function that targets format compliance, localization precision, and alignment accuracy.Extensive experiments confirm that REVEALER achieves state-of-the-art results across four benchmarks. Notably, on EvalMuse-40K, it surpasses the strong proprietary Gemini 3 Pro and Training-based baselines with absolute accuracy gains of +4.2% and +13.3%, respectively. Ablation studies further demonstrate the efficacy of our method, contributing a cumulative 19.6% improvement over the base model.
Anthology ID:
2026.acl-long.2200
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47630–47649
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2200/
DOI:
Bibkey:
Cite (ACL):
FuLin Shi, Wenyi Xiao, Leilei Gan, Liang Ding, and Binchen. 2026. REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 47630–47649, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation (Shi et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.2200.pdf
Checklist:
 2026.acl-long.2200.checklist.pdf