Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang (Editors)
- Anthology ID:
- 2026.alvr-main
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Venues:
- ALVR | WS
- Events:
- Annual Meeting of the Association for Computational Linguistics (2026) | Workshop on Advances in Language and Vision Research (2026) | Other Workshops and Events (2026)
- SIG:
- Publisher:
- Association for Computational Linguistics
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main/
- DOI:
- ISBN:
- 979-8-89176-398-2
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.alvr-main.pdf
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Qianqi Yan | Syrielle Montariol | Yue Fan | Jing Gu | Jiayi Pan | Manling Li | Parisa Kordjamshidi | Alane Suhr | Xin Eric Wang
Qianqi Yan | Syrielle Montariol | Yue Fan | Jing Gu | Jiayi Pan | Manling Li | Parisa Kordjamshidi | Alane Suhr | Xin Eric Wang
Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models
Ben Jenkins
Ben Jenkins
Chain-of-thought (CoT) reasoning has become a standard technique for eliciting complex reasoning in large language models, and recent work has extended it to vision-language models (VLMs). However, virtually all multimodal CoT methods generate intermediate reasoning steps in natural language, even for inherently visual problems such as spatial reasoning, geometric manipulation, and object tracking. We ask a fundamental question: when should a VLM reason in words, and when should it reason in pictures? We present VisCoT-Diag, a diagnostic benchmark of 1,200 instances across five visual reasoning categories, and compare four CoT paradigms across four VLMs. Our results reveal a striking modality gap: textual CoT degrades performance by up to 17.5% on spatial transformation and 13.2% on multi-object tracking, while visual CoT yields gains of up to 23.1%. We identify three failure modes (spatial state collapse, transformation hallucination, tracking loss) and show that adaptive modality routing achieves 73.1% accuracy versus 68.9% for V-CoT-everywhere. We recommend practitioners use visual CoT for spatial tasks and textual CoT for compositional counting.
The rapid evolution of text-to-image generation has blurred the perceptual boundary between natural and synthetic imagery. However, it remains questionable whether the statistical structure of generated visual content mirrors the information density of the physical visual world. Drawing upon principles from statistical linguistics, this study investigates the visual language of generative models through the lens of Zipfian dynamics. By analyzing a large-scale corpus of real and synthetic images, we uncover a fundamental divergence between visual syntax and semantics. We find that while generative models have successfully replicated the low-level physics of light, their high-level texture vocabulary exhibits distinct statistical signatures. Our analysis reveals a spectrum of entropy, identifying architectural fingerprints unique to each model. Furthermore, we investigate the relation ship between generated images and prompt complexity, and find that increasing the semantic specificity of text prompts systematically degrades the statistical realism of the generated output.
Semantically Aware Optimal Transport for Dense Label Transfer
Preeti | Kiran Ravish | Ankita Kushwaha | Pawan Kumar
Preeti | Kiran Ravish | Ankita Kushwaha | Pawan Kumar
Vision foundation models produce features that generalize across visual domains without fine-tuning, yet naively transferring labels through these feature spaces fails under large distribution shifts.We propose SAOT (**S**emantically **A**ware **O**ptimal **T**ransport), which learns a transport cost within a fused unbalanced optimal transport formulation for dense label transfer from frozen vision transformer features to new domains.SAOT combines a learnable appearance metric with semantic class-prototype priors, unbalanced transport for partial matching under distribution shift, and a block-sparse solver for tractable inference.We pair this with a two-stage decoder: an MLP trained on SAOT pseudo-labels, then refined via EMA-teacher self-training with class-balanced sampling.On GTA5→Cityscapes with frozen DINOv2 ViT-L/14 features, SAOT+Decoder reaches 25.7% mIoU, a **3.8×** improvement over nearest-neighbor transfer (6.7%), without any backbone adaptation.Per-class results show large gains on spatially coherent classes (road 90.3%, car 76.2%, building 71.5%), demonstrating that learned semantic transport costs capture domain-invariant structure even under severe synthetic-to-real shifts. On VOC train→val with frozen ViT-B/16 features, the full pipeline reaches 47.5% mIoU, indicating that the approach extends beyond synthetic-to-real adaptation.
CoSMoEs: Compact Sparse Mixture of Experts
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Sparse Mixture of Expert (MoE) models are widely used foundation architectures at large scale, yet remain under-explored at smaller sizes. In this work, we introduce Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing three key challenges: Quality, Memory, and Latency. On the quality front, we conduct a fair evaluation (removing confounding factors) and show that MoE architectures outperform dense models at on-device scale. We further propose weight-decomposed experts, which improve MoE performance beyond the standard formulation. On the memory and latency front, we address the prohibitively large parameter count of MoE models by improving expert offloading efficiency through a novel training-time loss, reducing inference latency for on-device deployment
GraphicWeaver: Benchmarking Agentic Planning for Graphic Design Generation
Dayeon Ki | Tianyi Zhou | Marine Carpuat | Gang Wu | Puneet Mathur | Viswanathan Swaminathan
Dayeon Ki | Tianyi Zhou | Marine Carpuat | Gang Wu | Puneet Mathur | Viswanathan Swaminathan
Vision-language model (VLM)-powered agents are increasingly enabling new forms of automation across various human tasks. While prior work has primarily focused on well-defined problems with explicit goals, the capabilities of agents in creative graphic design, where goals are inherently open-ended and subjective, remain largely underexplored.To bridge this gap, we introduce GraphicWeaver, a planning benchmark for graphic design comprising 1,079 diverse user queries and associated images spanning four design categories.Comprehensive experiments with six models reveal that current VLM-based agents struggle to handle such complex planning tasks, which require taking into account both explicit design constraints specified in queries and implicit commonsense design principles. We attribute these failures to challenges in (1) retrieving appropriate parameters for tool usage, (2) understanding spatial relationships across design components, and (3) coordinating dependencies across agents. We envision GraphicWeaver as a challenging yet valuable testbed for advancing VLM agent planning in creative design contexts.
Scaling Vision–Language Models for Pharmaceutical Long-Form Video Reasoning on Industrial GenAI Platform
Suyash Mishra | Qiang Li | Srikanth Patil | Satyanarayan Pati | Baddu Narendra
Suyash Mishra | Qiang Li | Srikanth Patil | Satyanarayan Pati | Baddu Narendra
Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3–8X efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
PGGA: A Plan-Grounded GUI Agent for Automated Device Support
Lei Hsiung | Zhiyu Chen | Seonhoon Kim | Qun Liu
Lei Hsiung | Zhiyu Chen | Seonhoon Kim | Qun Liu
Current GUI agents struggle with multi-step digital device support. We investigate whether this failure is partly caused by a procedural knowledge deficit: agents often rely on zero-shot visual exploration instead of executing verified instructions. To address this, we introduce the Plan-Grounded GUI Agent (PGGA), framing interface navigation as a knowledge-execution problem by conditioning low-level actions on step-by-step text plans. Evaluated on our focused Device-Support Interaction Benchmark (DSIB), results reveal a sharp gap between knowing which operation to perform and grounding that operation on the screen: GTA1-7B reaches 99.59% Operation Accuracy with expert plans, but only 82.99% Element Accuracy and 45.61% Task Success Rate; without plans, its Task Success Rate is 0.00%. Our fine-tuned 2B-parameter PGGA achieves 54.39% Task Success Rate and 91.28% Element Accuracy when guided by expert plans, suggesting that explicit procedural grounding can substantially improve GUI execution when high-quality plans are available. Project Page: https://hsiung.cc/PGGA/
CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring
Jiamin Su | Yibo Yan | Zhuoran Gao | Han Zhang | Xiang Liu | Huiyu Zhou | Xuming Hu
Jiamin Su | Yibo Yan | Zhuoran Gao | Han Zhang | Xiang Liu | Huiyu Zhou | Xuming Hu
Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed and evidence-grounded feedback; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using widely adopted MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, with particularly strong gains in grammatical and lexical diversity. Our proposed CAFES paves the way for an intelligent multimodal AES system. The code and dataset are available at https://anonymous.4open.science/r/CAFES-C87F/.
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Jianghangfan Zhang | Yibo Yan | Kening Zheng | Xin Zou | Song Dai | Xuming Hu
Jianghangfan Zhang | Yibo Yan | Kening Zheng | Xin Zou | Song Dai | Xuming Hu
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the **Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator**. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset.
Look Where You’re Told: Instruction-Consistent Attention for GUI Grounding
Seonhoon Kim | Zhiyu Chen | Xin Li | Qun Liu
Seonhoon Kim | Zhiyu Chen | Xin Li | Qun Liu
Visual grounding in graphical user interface (GUI) requires accurate localization of UI elements from natural language instructions. Conventional coordinate generation approaches face inherent limitations, including sensitivity to resolution variations and lack of interpretability. Recently, coordinate-free attention-based methods have emerged as a promising alternative, but these methods supervise attention using only spatial location signals from ground-truth bounding boxes, without ensuring that the learned attention distributions reflect genuine semantic correspondence between the instruction and the attended visual regions. We propose Attention Cycle-Consistency (ACC), a self-supervised regularization framework that enforces bidirectional alignment between visual attention and instruction semantics. ACC introduces two complementary constraints: semantic consistency, which ensures attended visual regions contain sufficient information to reconstruct the original instruction, and spatial consistency, which requires attention distributions to remain invariant when cycled through instruction reconstruction. We further incorporate entropy regularization to encourage spatially concentrated attention. ACC is applicable as a lightweight, model-agnostic regularizer for attention-based coordinate-free grounding methods, adding zero computational overhead at inference as all auxiliary components are discarded after training.
From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Alberto Gonzalo Rodriguez Salgado
Alberto Gonzalo Rodriguez Salgado
How do multimodal models solve visual spatial tasks—through genuine planning, or by brute-forcing solutions in token space? We introduce MazeBench, a benchmark of 110 procedurally generated maze images organized into nine controlled groups (diagnostic, grid scale, wall density, trap ablation, unreachable detection, and more), and evaluate 16 model configurations across four providers (OpenAI, Anthropic, Google, Alibaba) at multiple reasoning effort levels. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but our analysis reveals these scores are misleading: models translate images into text grids and brute-force paths via serial enumeration, consuming 1,710–22,818 tokens per solve for a task humans do in seconds. Without added reasoning budgets, all configurations score only 2–12%; on 20x20 ultra-hard mazes, they hit token limits and give up. Qualitative analysis of model outputs confirms a universal two-stage strategy: image-to-grid translation followed by step-by-step path search in natural language—essentially BFS implemented in prose. A text-grid ablation shows Claude’s poor image performance (6%) jumps to 80% when given the correct grid directly, confirming vision quality, not reasoning ability, as the bottleneck for weaker models. Perhaps most striking, when we explicitly instruct models not to build a text grid and not to perform graph search—asking them to "reason visually, like a human"—they silently ignore the instruction and immediately fall back to the same grid-enumeration strategy. This suggests that brute-force token-level search is the dominant mechanism these models rely on for spatial planning in our setting.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Philip Wootaek Shin | Ajay Narayanan Sridhar | Lakshmi Sivani Devarapalli | Rui Zhang | Jack Sampson | Vijaykrishnan Narayanan
Philip Wootaek Shin | Ajay Narayanan Sridhar | Lakshmi Sivani Devarapalli | Rui Zhang | Jack Sampson | Vijaykrishnan Narayanan
Vision–language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman | Marufa Kamal | Fahad Rahman | Sunzida Siddique | Ahmed Rafi Hasan | Mohd Ariful Haque | Kishor Datta Gupta | Roy George
Md. Mahfuzur Rahman | Marufa Kamal | Fahad Rahman | Sunzida Siddique | Ahmed Rafi Hasan | Mohd Ariful Haque | Kishor Datta Gupta | Roy George
General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary. intended as a continuous, extensible monitor of differential framing under changing real-world inputs.
Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules
Kishor Datta Gupta | Mohd Ariful Haque | Marufa Kamal | Ahmed Rafi Hasan | Md. Mahfuzur Rahman | Roy George
Kishor Datta Gupta | Mohd Ariful Haque | Marufa Kamal | Ahmed Rafi Hasan | Md. Mahfuzur Rahman | Roy George
Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain-Aware Rule-Triggered Variational Autoencoder (DART-VAE), a rule-guided multimodal clustering framework that incorporates domain-specific constraints directly into the representation learning process. DART-VAE extends the VAE architecture by embedding explicit rules, semantic representations, and data-driven features into a unified latent space, while enforcing constraint compliance through rule-consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post-hoc filters, DART-VAE treats rules as first-class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule-guided clustering produces more operationally meaningful and interpretable clusters—for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans—while improving traditional clustering metrics. However, the framework faces challenges: LLM-generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DART-VAE achieves more meaningful and consistent clustering outcomes than purely data-driven models, highlighting the utility of constraint-guided multimodal clustering for complex, knowledge-intensive settings.
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
Formal Machine Interpretation for the Semasiographic Mixtec Codices of Precolonial and Early Colonial Mesoamerica
Christopher Driggers-Ellis | Gabriel Ayoubi | Girish.Salunke811@Gmail.Com Girish.Salunke811@Gmail.Com | Christan Grant
Christopher Driggers-Ellis | Gabriel Ayoubi | Girish.Salunke811@Gmail.Com Girish.Salunke811@Gmail.Com | Christan Grant
The precolonial and early colonial Mixtec codices describe the history and stories of the region in a semasiographic medium that is full of symbolic representations and meant to be narrated.Recently, the community has introduced datasets of XML representations of related media, including Aztec codices and Mayan hieroglyphic script, in a step towards symbolic machine interpretation of these historic Mesoamerican artifacts.In this work, we propose formal symbolic machine interpretation of XML encodings representing facsimile images from the Mixtec Codex Zouche-Nuttal.We demonstrate the efficacy of symbolic machine interpretation from XML step-by-step, showing how our parser and interpreter process text capturing a scene from the Mixtec Codex Zouche-Nuttall.We hope our contribution and the example we provide motivate collaboration among the archaeological, historical, linguistic, and natural language processing research communities to apply machine interpretation to Mixtec codices and similar manuscripts.
Temporal-Linguistic Adaptive Streaming for Continuous Sign Language Translation
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Real-time sign language translation must generate text incrementally as signs arrive, yet existing streaming policies treat glosses as a flat token sequence and discard the temporal rhythm of signing. Inter-gloss pauses reliably mark sentence boundaries in continuous discourse, but policies such as Wait-k cause arbitrary cross-boundary fragmentation. We propose Temporal-Linguistic Adaptive Streaming (TLAS), which fuses a Temporal Pause Detector (TPD, tracking inter-gloss interval statistics via an exponential moving average) and a Linguistic Readiness Estimator (LRE, a trained neural head on a frozen T5 encoder) through an Adaptive Fusion Gate (AFG). A proactive timeout fires before the next gloss arrives when the inter-gloss gap exceeds a threshold, producing clean sentence segmentation without oracle boundary information. We also contribute a synthetic discourse dataset of 1,400 ASL discourse groups with LLM-generated per-gloss timestamps and introduce a continuous-stream evaluation paradigm requiring autonomous boundary detection from an unbroken gloss stream. Under such conditions, TLAS significantly outperforms current heuristic baselines, such as Wait-k, and methods relying solely on linguistic content.
Multimodal Large Language Models (MLLMs) have achieved remarkable success in semantic visual reasoning, yet their capacity for fine-grained, low-level perception remains critically under-evaluated. This perceptual fragility limits their reliability in noisy, real-world environments where visual signals are degraded. Furthermore, existing benchmarks often entangle visual perception with language priors, masking these underlying deficits. To address this, we introduce the **FAint numeric Detection Evaluation (FADE)** dataset, a novel evaluation suite designed to probe the limits of zero-shot Optical Character Recognition (OCR) in frontier MLLMs. By embedding synthetic, strictly numerical sequences over cluttered natural backgrounds at varying levels of transparency (𝛼), FADE explicitly disentangles pure visual perception from semantic predictability. We evaluate state-of-the-art models including Gemini 3.0, Claude 4.5 Sonnet, and Gemma 3 against a specialized UNet segmentation baseline. Our results reveal a striking limitation in frontier architectures: while they achieve near-perfect transcription at high visibility, their performance collapses under high transparency. Conversely, the UNet pipeline maintains robust spatial grounding, significantly outperforming generalist models at the lowest visibility thresholds. FADE provides a reproducible dataset to expose and diagnose the perceptual breakage points of modern multimodal systems.
Visual Question Answering (VQA) models process all image patches uniformlydespite questions typically requiring only a small subset of visual information.This inefficiency leads to unnecessary computation and can result in attentiondilution across irrelevant image regions. We propose Question-GuidedSparse Attention (QGSA), a plug-and-play mechanism that dynamically selectsrelevant image patches conditioned on question semantics. Our approach introducesthree components: (1)a differentiable patch selector based on Gumbel-Softmaxreparameterisation that enables end-to-end training with hard patch selection atinference; (2)a self-supervised grounding loss that encourages spatialselectivity without bounding-box annotations, combining contrastive patchselection with patch–word alignment via a frozen CLIP encoder; and (3)anadaptive sparsity mechanism that adjusts the number of selected patches accordingto estimated question complexity. Experiments on SmolVLM-256M-Instruct andSmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)demonstrate that QGSA reduces cross-attention FLOPs by 91–99% across inputresolutions, achieving up to 76× theoretical speedup at 576px resolution, whilemaintaining exact accuracy parity with the dense baseline (𝛥=0.0 ppon all datasets).Wall-clock parity with the dense baseline is reached at 336px; realisedend-to-end speedup requires larger models where cross-attention dominates totalcompute. QGSA consistently selects an average of k≈17 patches out of576 (256M model), up to k≈18 (500M model), yielding up to a 34×reduction in the visual token sequence. These small-scale results validate thefeasibility of question-conditioned sparse attention and provide a foundation forscaling to larger VLMs.
Systematic Performance Degradation in Indic Vision-Language Models: Evidence from Hindi and Telugu
Rishikant Chigrupaatii | Ponnada Sai Tulasi Kanishka | Lalit Chandra Routhu | Martin Patel | Sama Supratheek Reddy | Divyam Gupta | Rajiv Misra | Rohun Tripathi
Rishikant Chigrupaatii | Ponnada Sai Tulasi Kanishka | Lalit Chandra Routhu | Martin Patel | Sama Supratheek Reddy | Divyam Gupta | Rajiv Misra | Rohun Tripathi
With 1.5 billion people speaking over 120 major languages, India exemplifies the challenges of multilingual AI evaluation. Current multilingual VLM benchmarks suffer from unverified auto-translations, narrow task coverage, small sample sizes, and lack of culturally grounded content. We present HinTel-AlignBench, a comprehensive evaluation framework and benchmark for Hindi and Telugu vision-language models with English-aligned samples. Our framework combines semi-automated translation with human verification to generate 4k QA pairs per language across five domains: adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native Indic sets (JEE for STEM, VAANI for cultural grounding). Evaluation of state-of-the-art open and closed-source VLMs reveals consistent performance regression from English to Indic languages, with average drops of 8.3 points for Hindi and 5.5 points for Telugu across four of five tasks. We identify key failure modes and establish reproducible baselines for multilingual multimodal evaluation.
How Fragile Is Vision-Language Alignment? Mapping Concept Disruption Under Text-to-Image Personalization
Mujtaba Hasan
Mujtaba Hasan
Text-to-image diffusion models learn a mapping from natural language to visual structure, but how robust is this mapping to perturbation? We use personalization—fine-tuning a model to learn a new face, object, or style—as a controlled stress test to probe the fragility of learned vision-language alignment. We find that fine-tuning for one concept systematically shifts the model’s ability to faithfully render unrelated concepts, and that this disruption follows structured, predictable patterns. To measure this fragility, we construct Concept Entanglement Maps: per-prompt, per-model disruption matrices that reveal which concepts are most affected and why. Using Stable Diffusion v1.5 as a controlled testbed, we evaluate 15 subjects across three personalization methods on 200 prompts and report three findings about the organization of vision-language alignment: (1) aggregate disruption is larger for vision-backbone and cross-attention perturbations than for text-embedding perturbations, despite the latter directly modifying the language representation; (2) abstract and compositional language is significantly more fragile than concrete, object-specific language; and (3) disruption does not follow semantic proximity—personalizing for a face does not preferentially disrupt other face-related prompts (p = 1.0), suggesting that alignment vulnerability is organized globally rather than purely by semantic category. These findings expose a structural vulnerability in current text-to-image personalization: the same cross-attention mechanism that enables compositional generalization also creates pathways through which local fine-tuning can propagate as global alignment shift.
The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It
Kaustubh S. Bukkapatnam
Kaustubh S. Bukkapatnam
Large vision-language models (LVLMs) achieve strong performance on many multimodal tasks, yet consistently fail at compositional relational reasoning—distinguishing "the cat on the mat" from "the mat on the cat." We provide a formal explanation for this failure. We prove that any vision-language alignment operating on pooled (order-invariant) visual features contains compositional blind spots: semantically distinct scenes that map to identical representations. We show that the number of blind spots grows factorially with scene complexity, establishing a fundamental limit on pooled-feature architectures. Motivated by this analysis, we propose REGROUND, a training-free, test-time method that re-introduces spatial structure into alignment by performing relation-guided cross-attention over spatial visual tokens, directed by a lightweight parse of the text query. Without any fine-tuning, REGROUND improves compositional accuracy by +8.6 points on Winoground, +8.4 on ARO-Relation, +6.4 on SugarCrepe, and +8.4 on VSR when applied to LLaVA-1.5, and provides consistent gains across other LVLMs. Ablation studies confirm that each component—parse guidance, token-level attention, and relation masking—contributes significantly.
HalluTrace: Causal Attribution and Source-Targeted Decoding for Hallucination in Large Vision-Language Models
Kaustubh S. Bukkapatnam
Kaustubh S. Bukkapatnam
Object hallucination in large vision-language models (LVLMs) is well-documented, but the mechanisms that produce it remain poorly understood. We introduce HALLUTRACE, a causal attribution framework that decomposes hallucination into three distinct sources: (VGF) visual grounding failure, where the visual encoder produces a representation insufficient to identify the target object; (LPD) language prior dominance, where the language model overrides a correct visual signal with a statistically-driven prediction; and (CMC) cross-modal conflict, where visual and linguistic signals are irreconcilably inconsistent and the model resolves the conflict incorrectly. We operationalise these sources via causal component ablations: intervening on fvis, fproj, and fLM independently and measuring the change in CHAIR score. Experiments on five LVLMs show that attribution patterns are object-category-specific and model-consistent: person/vehicle hallucinations are predominantly LPD (≥52%), food/furniture hallucinations are predominantly VGF (≥44%), and animal hallucinations split between VGF and CMC. Guided by these attributions, we design HAD (Hallucination-Aware Decoding), a unified decoding strategy that applies source-targeted interventions: visual signal amplification for VGF, language prior suppression for LPD, and contrastive re-weighting for CMC. HAD reduces CHAIRI by 3.7–5.6 points and improves POPE F1 by 1.9–3.1 points over LLaVA-1.5, outperforming VCD and ICD on all three benchmarks (CHAIR, POPE, MME) without any additional training. We further prove that the attribution-decoding correspondence is tight: the CHAIR improvement from HAD is linearly predictable from the VGF attribution share (r = 0.86, p < 10−6), validating the causal framework.