Semantically-Prompted Language Models Improve Visual Descriptions

Michael Ogezi; Bradley Hauer; Grzegorz Kondrak

doi:10.18653/v1/2024.findings-naacl.267

Semantically-Prompted Language Models Improve Visual Descriptions

Michael Ogezi, Bradley Hauer, Grzegorz Kondrak

Abstract

Language-vision models like CLIP have made significant strides in vision tasks, such as zero-shot image classification (ZSIC). However, generating specific and expressive visual descriptions remains challenging; descriptions produced by current methods are often ambiguous and lacking in granularity. To tackle these issues, we propose V-GLOSS: Visual Glosses, a novel method built upon two key ideas. The first is Semantic Prompting, which conditions a language model on structured semantic knowledge. The second is a new contrastive algorithm that elicits fine-grained distinctions between similar concepts. With both ideas, we demonstrate that V-GLOSS improves visual descriptions and achieves strong results in the zero-shot setting on general and fine-grained image-classification datasets, including ImageNet, STL-10, FGVC Aircraft, and Flowers 102. Moreover, these descriptive capabilities contribute to enhancing image-generation performance. Finally, we introduce a quality-tested silver dataset with descriptions generated with V-GLOSS for all ImageNet classes.

Anthology ID:: 2024.findings-naacl.267
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4285–4302
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-naacl.267/
DOI:: 10.18653/v1/2024.findings-naacl.267
Bibkey:
Cite (ACL):: Michael Ogezi, Bradley Hauer, and Grzegorz Kondrak. 2024. Semantically-Prompted Language Models Improve Visual Descriptions. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 4285–4302, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Semantically-Prompted Language Models Improve Visual Descriptions (Ogezi et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-naacl.267.pdf
Video:: https://preview.aclanthology.org/jlcl-multiple-ingestion/2024.findings-naacl.267.mp4

PDF Cite Search Video Fix data