Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Sherzod Hakimov; David Schlangen

doi:10.18653/v1/2023.findings-acl.894

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Abstract

Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input – but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model’s output by providing a means of tracing the output back through the verbalised image content.

Anthology ID:: 2023.findings-acl.894
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14196–14210
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.findings-acl.894/
DOI:: 10.18653/v1/2023.findings-acl.894
Bibkey:
Cite (ACL):: Sherzod Hakimov and David Schlangen. 2023. Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14196–14210, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks (Hakimov & Schlangen, Findings 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.findings-acl.894.pdf
Video:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2023.findings-acl.894.mp4

PDF Cite Search Video Fix data