RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Dani Lischinski, Idan Szpektor


Abstract
Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability—ranging from enhanced personalization in image generation to consistent character representation in video rendering—progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.
Anthology ID:
2025.findings-emnlp.447
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8420–8438
Language:
URL:
https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.447/
DOI:
10.18653/v1/2025.findings-emnlp.447
Bibkey:
Cite (ACL):
Aviv Slobodkin, Hagai Taitelbaum, Yonatan Bitton, Brian Gordon, Michal Sokolik, Nitzan Bitton Guetta, Almog Gueta, Royi Rassin, Dani Lischinski, and Idan Szpektor. 2025. RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8420–8438, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation (Slobodkin et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.447.pdf
Checklist:
 2025.findings-emnlp.447.checklist.pdf