Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions

Maisha Maliha; Dean F. Hougen

Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions

Abstract

Text-to-image diffusion models achieve remarkable generation quality, yet their internal mechanisms for grounding prompt semantics into visual structure remain poorly understood. We present a novel mechanistic interpretability framework for Stable Diffusion that probes how individual prompt tokens are represented and utilized during the denoising process. Given a prompt, we record cross-attention activations throughout UNet denoising and convert them into token-level spatial grounding maps that indicate where each token contributes signal during image synthesis. To establish causal faithfulness, we perform controlled prompt interventions by removing a single word at a time while keeping the sampling seed fixed, producing counterfactual generations. To quantify mechanistic sensitivity, we introduce a head-resolved spike score based on divergence between per-head token contribution distributions before and after intervention, enabling module-wise and head-wise attribution of semantic changes. Experiments on compositional prompts and challenging relational descriptions reveal systematic patterns of token grounding, semantic drift, and head specialization across denoising timesteps. Our results provide a practical and reproducible toolkit for analyzing how diffusion models encode and apply semantic information, supporting deeper transparency in text-to-image generation.

Anthology ID:: 2026.findings-acl.1265
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25287–25299
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1265/
DOI:
Bibkey:
Cite (ACL):: Maisha Maliha and Dean F. Hougen. 2026. Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25287–25299, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions (Maliha & Hougen, Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1265.pdf
Checklist:: 2026.findings-acl.1265.checklist.pdf

PDF Cite Search Checklist Fix data