IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning
Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha
Abstract
Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a “green” tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.- Anthology ID:
- 2024.emnlp-main.1092
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19584–19601
- Language:
- URL:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.emnlp-main.1092/
- DOI:
- 10.18653/v1/2024.emnlp-main.1092
- Cite (ACL):
- Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, and Dinesh Manocha. 2024. IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19584–19601, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning (Ghosal et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/add-emnlp-2024-awards/2024.emnlp-main.1092.pdf