Ye Xiong

2026

Progressive Visual Refinement for Multi-modal Summarization
Ye Xiong | Hidetaka Kamigaito | Soichiro Murakami | Peinan Zhang | Hiroya Takamura | Manabu Okumura
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

Multi-modal summarization (MMS) has emerged as a critical research area driven by the proliferation of multimedia content, focusing on generating condensed summaries by cross-modal complementary information synthesis. Previous studies have demonstrated the effectiveness of heterogeneous fusion paradigms, particularly through visual-centric feature extraction mechanisms, in constructing cross-modal representations that yield substantial performance gains. Nevertheless, the comprehensive utilization of multimodal information along with the intricate interdependencies among textual content, visual elements, and the summary generation process has been still insufficiently explored. We propose the Patch-Refined Visual Information Network (PRVIN) to address the insufficient exploitation of visual information. The essential patch selector and patch refiner components in PRVIN work collaboratively to progressively identify and refine critical visual features. An additional vision-to-summary alignment mechanism is also introduced to enhance the semantic connections between multi-modal representations and summary outputs. Extensive experiments conducted on two public MMS benchmark datasets demonstrate the superiority of PRVIN while quantitatively validating the crucial role of comprehensive visual information utilization in MMS tasks.

Co-authors

Venues

EACL1

Fix author