Wei He

Other people with similar names: Wei He

Unverified author pages with similar names: Wei He


2026

Vision-Language Models (VLMs) excel at photorealistic generation, yet often struggle to represent abstract meaning such as idiomatic interpretations of noun compounds. To study whether high visual fidelity interferes with idiomatic compositionality under visual abstraction, we introduce DIVA, a controlled benchmark that replaces high-fidelity visual detail with schematic iconicity by generating paired, sense-anchored visualizations for literal and idiomatic readings.We further propose Semantic Alignment Gap (𝛥), an architecture-agnostic metric that quantifies divergence between literal and idiomatic visual grounding.We additionally introduce a directional signed bias b(t) to separately measure the direction and strength of literal preference.Evaluating 8 recent VLMs, we reveal a consistent Literal Superiority Bias: model scale alone does not resolve literal preference, and increased visual fidelity is associated with weaker symbolic alignment, suggesting cognitive interference from hyper-realistic imagery. Our findings suggest that improving compositional understanding requires iconographic abstraction of visual input and anchoring interpretation and generation in intended meaning.
Idiomatic expressions present a unique chal-lenge in NLP, as their meanings are often notdirectly inferable from their constituent words.Despite recent advancements in large languagemodels, idiomaticity remains a significant ob-stacle to robust semantic representation. Wepresent datasets and task results for MWE-2026 Shared Task 2: Advancing MultimodalIdiomaticity Representation 2 (AdMIRe 2),which challenges the community to assess andimprove models’ ability to interpret idiomaticexpressions in multimodal contexts across mul-tiple languages. Participants competed in animage ranking task in which, for each item,systems receive a context sentence containinga potentially idiomatic expression (PIE) andfive candidate images. Participating systemsare required to predict the sentence type (i.e.,idiomatic vs. literal) for the given context andrank the images by how well they depict the in-tended meaning in that context. Among the par-ticipating systems the most effective methodsinclude pipelines utilizing closed-source com-mercial models such as Gemini 2.5 and GPT-5, and employing chain-of-thought reasoningstrategies. Methods to mitigate language mod-els’ bias towards literal interpretations and en-sembles to smooth out variance were common.