Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Dan Oneață; Desmond Elliott; Stella Frank

Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era

Dan Oneata, Desmond Elliott, Stella Frank

Abstract

Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.

Anthology ID:: 2025.findings-acl.1240
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 24174–24191
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1240/
DOI:
Bibkey:
Cite (ACL):: Dan Oneata, Desmond Elliott, and Stella Frank. 2025. Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era. In Findings of the Association for Computational Linguistics: ACL 2025, pages 24174–24191, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era (Oneata et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1240.pdf

PDF Cite Search Fix data