Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

Abid Ali, Diego Molla, Usman Naseem


Abstract
Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Process (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.
Anthology ID:
2026.findings-acl.1451
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29032–29047
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1451/
DOI:
Bibkey:
Cite (ACL):
Abid Ali, Diego Molla, and Usman Naseem. 2026. Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29032–29047, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention (Ali et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1451.pdf
Checklist:
 2026.findings-acl.1451.checklist.pdf