D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie Zhou


Abstract
Many-to-many multimodal summarization (M3S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises of multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS, little research pays attention to the M3S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M3S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M3S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we contribute a many-to-many multimodal summarization (lmttM3Sum) dataset with 44 languages to facilitate future research.
Anthology ID:
2023.findings-emnlp.994
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14910–14922
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.994
DOI:
10.18653/v1/2023.findings-emnlp.994
Bibkey:
Cite (ACL):
Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, and Jie Zhou. 2023. D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14910–14922, Singapore. Association for Computational Linguistics.
Cite (Informal):
D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization (Liang et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2023.findings-emnlp.994.pdf