V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Qidong Wang; Junjie Hu; Ming Jiang

V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models

Abstract

Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines **V**isual **S**emantic **E**diting and **A**ttention **M**odulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLAVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.

Anthology ID:: 2025.emnlp-main.880
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17407–17431
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.880/
DOI:
Bibkey:
Cite (ACL):: Qidong Wang, Junjie Hu, and Ming Jiang. 2025. V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17407–17431, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models (Wang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.880.pdf
Checklist:: 2025.emnlp-main.880.checklist.pdf

PDF Cite Search Checklist Fix data