Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Haomin Zuo, Yidi Li, Luoxiao Yang, Xiaofeng Zhang


Abstract
While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.
Anthology ID:
2026.acl-long.553
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12085–12101
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.553/
DOI:
Bibkey:
Cite (ACL):
Haomin Zuo, Yidi Li, Luoxiao Yang, and Xiaofeng Zhang. 2026. Diffusion-CAM: Faithful Visual Explanations for dMLLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12085–12101, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Diffusion-CAM: Faithful Visual Explanations for dMLLMs (Zuo et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.553.pdf
Checklist:
 2026.acl-long.553.checklist.pdf