Distilled Dual-Encoder Model for Vision-Language Understanding
Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, Furu Wei
Abstract
On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DiDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4 times faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.- Anthology ID:
- 2022.emnlp-main.608
- Volume:
- Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Editors:
- Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 8901–8913
- Language:
- URL:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2022.emnlp-main.608/
- DOI:
- 10.18653/v1/2022.emnlp-main.608
- Cite (ACL):
- Zekun Wang, Wenhui Wang, Haichao Zhu, Ming Liu, Bing Qin, and Furu Wei. 2022. Distilled Dual-Encoder Model for Vision-Language Understanding. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8901–8913, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Distilled Dual-Encoder Model for Vision-Language Understanding (Wang et al., EMNLP 2022)
- PDF:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2022.emnlp-main.608.pdf