Knowledge Transfer with Visual Prompt in multi-modal Dialogue Understanding and Generation

Minjun Zhu, Yixuan Weng, Bin Li, Shizhu He, Kang Liu, Jun Zhao


Abstract
Visual Dialogue (VD) task has recently received increasing attention in AI research. Visual Dialog aims to generate multi-round, interactive responses based on the dialog history and image content. Existing textual dialogue models cannot fully understand visual information, resulting in a lack of scene features when communicating with humans continuously. Therefore, how to efficiently fuse multimodal data features remains to be a challenge. In this work, we propose a knowledge transfer method with visual prompt (VPTG) fusing multi-modal data, which is a flexible module that can utilize the text-only seq2seq model to handle visual dialogue tasks. The VPTG conducts text-image co-learning and multi-modal information fusion with visual prompts and visual knowledge distillation. Specifically, we construct visual prompts from visual representations and then induce sequence-to-sequence(seq2seq) models to fuse visual information and textual contexts by visual-text patterns. And we also realize visual knowledge transfer through distillation between two different models’ text representations, so that the seq2seq model can actively learn visual semantic representations. Extensive experiments on the multi-modal dialogue understanding and generation (MDUG) datasets show the proposed VPTG outperforms other single-modal methods, which demonstrate the effectiveness of visual prompt and visual knowledge transfer.
Anthology ID:
2022.tu-1.2
Volume:
Proceedings of the First Workshop On Transcript Understanding
Month:
Oct
Year:
2022
Address:
Gyeongju, South Korea
Editors:
Franck Dernoncourt, Thien Huu Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh, Trung H. Bui, David Seunghyun Yoon
Venue:
TU
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
8–19
Language:
URL:
https://aclanthology.org/2022.tu-1.2
DOI:
Bibkey:
Cite (ACL):
Minjun Zhu, Yixuan Weng, Bin Li, Shizhu He, Kang Liu, and Jun Zhao. 2022. Knowledge Transfer with Visual Prompt in multi-modal Dialogue Understanding and Generation. In Proceedings of the First Workshop On Transcript Understanding, pages 8–19, Gyeongju, South Korea. International Conference on Computational Linguistics.
Cite (Informal):
Knowledge Transfer with Visual Prompt in multi-modal Dialogue Understanding and Generation (Zhu et al., TU 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2022.tu-1.2.pdf