# Style Vectors for Steering Generative Large Language Models

## Description

This research aims at exploring strategies for steering the output of large language models (LLMs) towards a specific style, e.g., a sentiment, emotion, or writing style, by manipulating the activations of hidden layers during text generation. In a series of experiments it will be shown that it is principally possible to find such vectors associated with certain style classes that when added to the hidden activations of an LLMs during a forward pass to produce answers in a desired style. The results constitute a step towards adaptive and affective AI empowered interactive systems. Moreover, possible negative aspects of manipulating the style of LLMs are discussed as well.

## Installation

All required packages can be found in the ```requirements.txt```. It is recommended to set up an anaconda environment with these packages: 
```bash
conda create -n emex python=3.8.8
conda activate emex
cd /path/to/emex-emotion-explanation-in-ai
conda install pip # make sure pip is installed
pip install -r requirements.txt
pip install transformers@https://github.com/huggingface/transformers/archive/refs/heads/main.zip
pip install -e . # install the emex package itself - see setup.py
```

## Datasets

We use three different datasets:
1) Yelp Review Dataset: https://github.com/shentianxiao/language-style-transfer
2) Shakespeare Dataset: https://github.com/harsh19/Shakespearizing-Modern-English.git  
3) GoEmotion Dataset: https://huggingface.co/datasets/go_emotions

They are processed and loaded using [dataset_loader.py](utils/dataset_loader.py).

Yelp: We removed duplicates from the dataset, because we wanted steering vectors for as many as possible different target sentences.

GoEmotion: In order to base the analyses on a stronger theoretical foundation only 5k samples were used that could unambiguously mapped to the established six basic emotion categories proposed by Ekman. For this we loaded all values using [dataset_loader.py](utils/dataset_loader.py) and saved the filtered result in separate .pkl files as described in [data_prep.py](scripts/analysis/data_prep.py).  


## Training Steering Vectors

We can train a steering vector which manipulates the model to only output the tokens/sentence specified
(based on https://arxiv.org/pdf/2205.05124.pdf) using a script per dataset. For Yelp this is [llama_multi_steering_yelp.py](scripts/training/llama_multi_steering_yelp.py):

```bash
conda activate emex
cd /path/to/emex-emotion-explanation-in-ai
python scripts/training/llama_multi_steering_yelp.py
```

You can define for which layers you want to train steering vectors, by mofifying `INSERTION_LAYERS`. 

After training, the steering vectors are saved in ```STEERING_VECTOR_PATH```, which is defined in the script.

The optimization procedure is time- and compute-intensive. On our usual setup (NVIDIA Quadro GV100 with 32GB) we were only able to train 169 vectors in 48 hours. 

## Extracting Activation Vectors

Extracting and saving the hidden layer activations can be done using [get_hidden_activations.py](scripts/training/get_hidden_activations.py):

```bash
conda activate emex
cd /path/to/emex-emotion-explanation-in-ai
python scripts/training/get_hidden_activations.py
```
The activations will then be stored at `PATH_TO_ACTIVATION_STORAGE`.
Please keep in mind that storing the activations for all layers for all entries in a dataset can take a couple of hours and results in a couple of hundred GBs of .pkl files.
For the yelp dataset, which was our biggest one, this process resulted in a disk usage of ~334 GB. 


## Probing Study / Sentiment Classification

To generate the ROC plots from the papers we provide a script per dataset:
- Yelp: [classification_with_steering_vectors_yelp.py](scripts/training/classification_with_steering_vectors_yelp.py)
- Shakespeare: [classification_with_steering_vectors_shakespeare.py](scripts/training/classification_with_steering_vectors_shakespeare.py)
- GoEmotions: [classification_with_steering_vectors_goemo.py](scripts/training/classification_with_steering_vectors_goemo.py)

Usage:
```bash
conda activate emex
cd /path/to/emex-emotion-explanation-in-ai
python scripts/training/classification_with_steering_vectors_yelp.py
```

In the scripts you have to define the setting you want to evaluate. The keywords here are ```VECTOR_TYPE``` and ```ACTI_COMPARE_VECS```. The three combinations are:
1) ``VECTOR_TYPE == "steering"``: Evaluate the training-based steering vectors
2) ``VECTOR_TYPE == "activations"``: 
   1) ``ACTI_COMPARE_VECS == "fair"``: Evaluate the activation-based steering vectors for which a training-based steering vector exists
   2) ``ACTI_COMPARE_VECS == "all"``: Use all activation-based steering vectors (can take up to an hour to compute)

In the case of "all" activations, we don't use all of them for the Yelp Review dataset, but subsample to 10k activation vectors. When we tried to load all of them together we ran out of memory. For Shakespeare and GoEmotion this isn't necessary, because they have fewer vectors.


## Guided Text Generation
Once we have a set of steering vectors or activations, we can add them to the LLM model in order to guide the models output 
(For example, when prompting the model to write a review about a restaurant, we can add "positive" SVs in order to generate a more positive review)

The scripts for this are [emotion_eval.py](scripts/evaluation/emotion_eval.py) and [shakes_eval.py](scripts/evaluation/shakes_eval.py).

Here you also need to define onto which layers the steering vectors should be added.
These scripts load already computed training/activation-based style vectors and add them onto the output of the specified layers during inference/text generation.

You can exchange the model, which layers to use, and coefficients of the steering vectors.