MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting
Wenyan Li, Dong Li, Wanjing Li, Yuanjie Wang, Hai Jie, Yiran Zhong
Abstract
Pretrained vision-language (VL) models have shown impressive results on various multi-modal downstream tasks recently. Many of the benchmark models build on pretrained causal language models (LMs), leveraging the original few-shot learning and generalization capability of the LMs trained with large text corpora. However, these models are often gigantic and require large-scale image and text data with high computational cost to train. This paper introduces a moderate-size model called MAP for efficient VL transfer learning through adapter-based pretraining and prompting. We aim to answer the question of how much we can complete through VL pretraining within the low-data regime while maximizing efficiency in transferring knowledge of a moderate-size frozen LM. Our experiments demonstrate that MAP achieves substantially better zero-shot and few-shot performance on downstream VL tasks with only 10% the size of pretraining data and a 30x lighter pretrained LM backbone compared to Frozen. MAP also outperforms fully trained models of comparable size at retaining its transfer learning ability when the amount of training data reduces.- Anthology ID:
- 2023.clasp-1.19
- Volume:
- Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)
- Month:
- September
- Year:
- 2023
- Address:
- Gothenburg, Sweden
- Editors:
- Ellen Breitholtz, Shalom Lappin, Sharid Loaiciga, Nikolai Ilinykh, Simon Dobnik
- Venue:
- CLASP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 185–190
- Language:
- URL:
- https://aclanthology.org/2023.clasp-1.19
- DOI:
- Cite (ACL):
- Wenyan Li, Dong Li, Wanjing Li, Yuanjie Wang, Hai Jie, and Yiran Zhong. 2023. MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting. In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 185–190, Gothenburg, Sweden. Association for Computational Linguistics.
- Cite (Informal):
- MAP: Low-data Regime Multimodal Learning with Adapter-based Pre-training and Prompting (Li et al., CLASP 2023)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2023.clasp-1.19.pdf