Ritambhara Singh
2026
UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages.
Tassallah Abdullahi | Macton Mgonzo | Mardiyyah Oduwole | Paul Okewunmi | Abraham Toluwase Owodunni | Ritambhara Singh | Carsten Eickhoff
Findings of the Association for Computational Linguistics: ACL 2026
Tassallah Abdullahi | Macton Mgonzo | Mardiyyah Oduwole | Paul Okewunmi | Abraham Toluwase Owodunni | Ritambhara Singh | Carsten Eickhoff
Findings of the Association for Computational Linguistics: ACL 2026
Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Achieving robust safety requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first policy-based safety benchmark for African languages built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 15 models, comprising seven general-purpose LLMs and eight guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages.
Mechanisms of Prompt-Induced Hallucination in Vision–Language Models
William Rudman | Michal Golovanevsky | Dana Arad | Yonatan Belinkov | Carsten Eickhoff | Ritambhara Singh | Kyle Mahowald
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
William Rudman | Michal Golovanevsky | Dana Arad | Yonatan Belinkov | Carsten Eickhoff | Ritambhara Singh | Kyle Mahowald
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision–language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
2025
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky | William Rudman | Vedant Palit | Carsten Eickhoff | Ritambhara Singh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Michal Golovanevsky | William Rudman | Vedant Palit | Carsten Eickhoff | Ritambhara Singh
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vision-Language Models (VLMs) have gained prominence due to their success in solving complex cross-modal tasks. However, the internal mechanisms of VLMs, particularly the roles of cross-attention and self-attention in multimodal integration, are not fully understood. To address this gap, we introduce NOTICE, a Gaussian-Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE introduces Semantic Image Pairs (SIP) corruption, the first visual counterpart to Symmetric Token Replacement (STR) for text. Through NOTICE, we uncover a set of “universal attention heads” in BLIP and LLaVA that consistently contribute across different tasks and modalities. In BLIP, cross-attention heads implement object detection, object suppression, and outlier suppression, whereas important self-attention heads in LLaVA only perform outlier suppression. Notably, our findings reveal that cross-attention heads perform image-grounding, while self-attention in LLaVA heads do not, highlighting key differences in how VLM architectures handle multimodal learning.
Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
William Rudman | Michal Golovanevsky | Amir Bar | Vedant Palit | Yann LeCun | Carsten Eickhoff | Ritambhara Singh
Findings of the Association for Computational Linguistics: ACL 2025
William Rudman | Michal Golovanevsky | Amir Bar | Vedant Palit | Yann LeCun | Carsten Eickhoff | Ritambhara Singh
Findings of the Association for Computational Linguistics: ACL 2025
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of “sides” nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o’s accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning.
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
Michal Golovanevsky | William Rudman | Michael A. Lepori | Amir Bar | Ritambhara Singh | Carsten Eickhoff
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Michal Golovanevsky | William Rudman | Michael A. Lepori | Amir Bar | Ritambhara Singh | Carsten Eickhoff
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge or visual input through activation-level interventions. On average, PvP successfully shifts 99.3% of color and 80.8% of size predictions from priors to counterfactuals. Together, these findings offer new tools for interpreting and controlling factual behavior in multimodal models.