IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Soumya Suvra Ghosal; Samyadeep Basu; Soheil Feizi; Dinesh Manocha

doi:10.18653/v1/2024.emnlp-main.1092

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha

Abstract

Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a “green” tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.

Anthology ID:: 2024.emnlp-main.1092
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19584–19601
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.emnlp-main.1092/
DOI:: 10.18653/v1/2024.emnlp-main.1092
Bibkey:
Cite (ACL):: Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, and Dinesh Manocha. 2024. IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19584–19601, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning (Ghosal et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.emnlp-main.1092.pdf

PDF Cite Search Fix data