# Extract and Explore

Code for If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions.



## Setup

Install the following requirements:

```txt
accelerate==0.25.0
bitsandbytes==0.41.2.post2
datasets==2.15.0
open-clip-torch==2.23.0
peft==0.6.3.dev0
transformers==4.36.2
trl==0.7.5.dev0
vllm==0.2.7
torch==2.1.2
```


## Instructions

To analyze the representations of a contrastive VLM, first, use `train_runner.sh` to fine-tune and align an LLM with VLM preferences.
So, the LLM learns to generate descriptions that are closer to the corresponding images in the VLM embedding space.

After training, run `inference_runner.sh` to generate 25 descriptions that the VLM prioritizes for each concept.

Now, you can examine these descriptions to understand how the VLM represents each concept. For example, you can use `inspection_runner.sh` to ask ChatGPT if each description provides additional information about the corresponding concept.

To use [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) or [ALIGN](https://huggingface.co/docs/transformers/en/model_doc/align) checkpoints, just pass the model name on the Huggingface hub as the `vlm_name` argument. To use [OpenCLIP](https://github.com/mlfoundations/open_clip) models, set the `vlm_name` argument to `r-open-clip:MODEL:DATASET`, where `MODEL` is one of the models supported by OpenCLIP and `DATASET` is the pre-training dataset (e.g., `r-open-clip:ViT-bigG-14-CLIPA-336:datacomp1b`). To use OpenCLIP with Huggingface hub checkpoints, just use `r-open-clip:hf-hub:HUGGINGFACE_MODEL_NAME`, e.g., `r-open-clip:hf-hub:apple/DFN5B-CLIP-ViT-H-14-384`.
