World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma; Jiayi Pan; Joyce Chai

doi:10.18653/v1/2023.acl-long.31

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Abstract

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings, and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose World-to-Words (W2W), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that W2W is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly.

Anthology ID:: 2023.acl-long.31
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 524–544
Language:
URL:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.31/
DOI:: 10.18653/v1/2023.acl-long.31
Award:: Outstanding Paper Award
Bibkey:
Cite (ACL):: Ziqiao Ma, Jiayi Pan, and Joyce Chai. 2023. World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 524–544, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models (Ma et al., ACL 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.31.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-2/2023.acl-long.31.mp4

PDF Cite Search Video Fix data