# Annotator diversity

## Installation
1. Install the package in developer mode `pip install -e .`. This will also install the dependencies.
2.  We want to install sentence-transformers without installing its dependencies (it would overwrite some of the specific versions that we installed initially), so we install it after all the other ones with `pip install sentence-transformers --no-deps`
3. Done ...

## Data
- MFTC: download [here](https://osf.io/k5n7y/).
- MHS: can be downloaded through `preprocess.py`.
- DICES: download the csv files from Github [1](https://raw.githubusercontent.com/google-research-datasets/dices-dataset/main/350/diverse_safety_adversarial_dialog_350.csv) and [2](https://raw.githubusercontent.com/google-research-datasets/dices-dataset/main/990/diverse_safety_adversarial_dialog_990.csv), put into the data dir under `DICES/`.

## Preprocessing
Run `preprocess_data.py` after having downloaded all the data files. This will generate the files for the training files to read in, as well as the data splits.

## Methods
- **Passive Learning** (full, regular training) using `train.py`, where the models learns to predict probability distributions for each sample (using a soft loss).
- **Active Learning with random selection** using `train_active_learning.py`.

Active learning experiments can be run via bash scripts inside `scripts/active_learning/`. The script calls the function ``train_active_learning.py`` and specifies a number of parameters that can be changed. For each dataset and each label category we can add a script. The format of the data needs to be an unaggregated file (each row corresponds to one annotation) with a text, a data ID (id for each uniquer datapoint), and a label.

## Analysis / Figure creation
See the notebook files under `notebooks/`.

