Tejomay Kishor Padole


2026

Illustrating figurative language remains challenging due to its non-literal semantics, and existing text-to-image frameworks rely heavily on proprietary models or human supervision to achieve adequate alignment. We introduce CaRVE, a lightweight and fully open-source critique-driven framework that employs VLM feedback to refine visual elaborations for figurative image generation. CaRVE bridges the semantic alignment gap even in sub-4B models by correcting visual and conceptual misalignments, reducing over-literalization, and improving robustness to complex figurative expressions. Using only open-source models, CaRVE achieves a 6.49% improvement over prior baselines on intrinsic automatic evaluations and a +0.37 average rank gain in human preference. We further release MetaCaRVE, an enhanced figurative image dataset constructed by refining HAIVMet using CaRVE.