Juntuo Wang
2026
From Mimesis to Metamorphosis: Evolving VLM Judges via In-Context Comparing and Knowledge Internalization
Juntuo Wang | Yuming Qiao | Yifan Yang | Lunxi Yuan | Liang Luo | Dan Meng
Findings of the Association for Computational Linguistics: ACL 2026
Juntuo Wang | Yuming Qiao | Yifan Yang | Lunxi Yuan | Liang Luo | Dan Meng
Findings of the Association for Computational Linguistics: ACL 2026
Vision-language models (VLMs) are increasingly adopted as judges for subjective assessment, yet absolute scoring remains brittle due to inconsistent scales and inherent preference biases. To bridge this gap, we propose S2AD (**Semantic-Anchored Scale-Agnostic Distillation**), a novel easy-to-hard framework that operationalizes subjective assessment as comparative analysis, conceptualizing the judge’s evolution from mimesis to metamorphosis. In Stage 1 (Mimesis), we introduce Dynamic Soft Positioning (DSP) to train the judge to compare a query against retrieved reference images, establishing a relative evaluation space that ensures consistent ordering under heterogeneous scales. In Stage 2 (Metamorphosis), this comparative capability is internalized via Language Buttons—discrete semantic levels serving as a retrieval-free internal reference. Optimized with Group Relative Policy Optimization (GRPO), S2AD achieves efficient, scale-steerable inference that adapts to diverse grading standards. Our framework reaches state-of-the-art performance across multiple benchmarks, validating the effectiveness of internalized comparative priors for robust, rank-invariant, and scale-steerable evaluation. The code is available at: https://github.com/SpatialVision-Research/SSAD_ACL2026_Findings.