Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

Chengxu Zhuang, Evelina Fedorenko, Jacob Andreas


Abstract
Today’s most accurate language models are trained on orders of magnitude more language data than human language learners receive— but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs’ representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next-token prediction strategy with a contrastive visual grounding objective, focusing on early-layerrepresentations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastiveGrounding not only outperforms standard language-only models in terms of learning efficiency in small and developmentally plausible data regimes, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization.Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks compared to other models trained on the same amount of text data. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
Anthology ID:
2024.findings-acl.15
Volume:
Findings of the Association for Computational Linguistics: ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
231–247
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.findings-acl.15/
DOI:
10.18653/v1/2024.findings-acl.15
Bibkey:
Cite (ACL):
Chengxu Zhuang, Evelina Fedorenko, and Jacob Andreas. 2024. Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling. In Findings of the Association for Computational Linguistics: ACL 2024, pages 231–247, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling (Zhuang et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.findings-acl.15.pdf