Fine-tuning vs From Scratch: Do Vision & Language Models Have Similar Capabilities on Out-of-Distribution Visual Question Answering?

Kristian Nørgaard Jensen, Barbara Plank


Abstract
Fine-tuning general-purpose pre-trained models has become a de-facto standard, also for Vision and Language tasks such as Visual Question Answering (VQA). In this paper, we take a step back and ask whether a fine-tuned model has superior linguistic and reasoning capabilities than a prior state-of-the-art architecture trained from scratch on the training data alone. We perform a fine-grained evaluation on out-of-distribution data, including an analysis on robustness due to linguistic variation (rephrasings). Our empirical results confirm the benefit of pre-training on overall performance and rephrasing in particular. But our results also uncover surprising limitations, particularly for answering questions involving boolean operations. To complement the empirical evaluation, this paper also surveys relevant earlier work on 1) available VQA data sets, 2) models developed for VQA, 3) pre-trained Vision+Language models, and 4) earlier fine-grained evaluation of pre-trained Vision+Language models.
Anthology ID:
2022.lrec-1.161
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1496–1508
Language:
URL:
https://aclanthology.org/2022.lrec-1.161
DOI:
Bibkey:
Cite (ACL):
Kristian Nørgaard Jensen and Barbara Plank. 2022. Fine-tuning vs From Scratch: Do Vision & Language Models Have Similar Capabilities on Out-of-Distribution Visual Question Answering?. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1496–1508, Marseille, France. European Language Resources Association.
Cite (Informal):
Fine-tuning vs From Scratch: Do Vision & Language Models Have Similar Capabilities on Out-of-Distribution Visual Question Answering? (Jensen & Plank, LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2022.lrec-1.161.pdf
Data
C4CLEVRDAQUARGQAMS COCOTextVQAVisual GenomeVisual Question AnsweringVisual Question Answering v2.0