Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions

Maisha Maliha, Dean F. Hougen


Abstract
Text-to-image diffusion models achieve remarkable generation quality, yet their internal mechanisms for grounding prompt semantics into visual structure remain poorly understood. We present a novel mechanistic interpretability framework for Stable Diffusion that probes how individual prompt tokens are represented and utilized during the denoising process. Given a prompt, we record cross-attention activations throughout UNet denoising and convert them into token-level spatial grounding maps that indicate where each token contributes signal during image synthesis. To establish causal faithfulness, we perform controlled prompt interventions by removing a single word at a time while keeping the sampling seed fixed, producing counterfactual generations. To quantify mechanistic sensitivity, we introduce a head-resolved spike score based on divergence between per-head token contribution distributions before and after intervention, enabling module-wise and head-wise attribution of semantic changes. Experiments on compositional prompts and challenging relational descriptions reveal systematic patterns of token grounding, semantic drift, and head specialization across denoising timesteps. Our results provide a practical and reproducible toolkit for analyzing how diffusion models encode and apply semantic information, supporting deeper transparency in text-to-image generation.
Anthology ID:
2026.findings-acl.1265
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25287–25299
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1265/
DOI:
Bibkey:
Cite (ACL):
Maisha Maliha and Dean F. Hougen. 2026. Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25287–25299, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Mechanistic Interpretability of Text-to-Image Diffusion Models via Cross-Attention Interventions (Maliha & Hougen, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1265.pdf
Checklist:
 2026.findings-acl.1265.checklist.pdf