MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, Anette Frank


Abstract
Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. We present MAGMA - a simple method for augmenting generative language models with additional modalities using adapter-based finetuning. Building on Frozen, we train a series of VL models that autoregressively generate text from arbitrary combinations of visual and textual input. The pretraining is entirely end-to-end using a single language modeling objective, simplifying optimization compared to previous approaches. Importantly, the language model weights remain unchanged during training, allowing for transfer of encyclopedic knowledge and in-context learning abilities from language pretraining. MAGMA outperforms Frozen on open-ended generative tasks, achieving state of the art results on the OKVQA benchmark and competitive results on a range of other popular VL benchmarks, while pretraining on 0.2 % of the number of samples used to train SimVLM.
Anthology ID:
2022.findings-emnlp.179
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2416–2428
Language:
URL:
https://aclanthology.org/2022.findings-emnlp.179
DOI:
10.18653/v1/2022.findings-emnlp.179
Bibkey:
Cite (ACL):
Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. 2022. MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2416–2428, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning (Eichenberg et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.findings-emnlp.179.pdf