D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Yunlong Liang; Fandong Meng; Jiaan Wang; Jinan Xu (徐金安); Yufeng Chen (陈钰枫); Jie Zhou (周洁)

doi:10.18653/v1/2023.findings-emnlp.994

D²TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie Zhou

Abstract

Many-to-many multimodal summarization (M³S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises of multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS, little research pays attention to the M³S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M³S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M³S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we contribute a many-to-many multimodal summarization (lmttM³Sum) dataset with 44 languages to facilitate future research.

Anthology ID:: 2023.findings-emnlp.994
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14910–14922
Language:
URL:: https://preview.aclanthology.org/landing_page/2023.findings-emnlp.994/
DOI:: 10.18653/v1/2023.findings-emnlp.994
Bibkey:
Cite (ACL):: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, and Jie Zhou. 2023. D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14910–14922, Singapore. Association for Computational Linguistics.
Cite (Informal):: D2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization (Liang et al., Findings 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2023.findings-emnlp.994.pdf

PDF Cite Search Fix data