# Stollen

This repository contains algorithms to check softmax layers for classes that are unargmaxable.
It also contains code to reproduce our results, tables and figures in the paper.


# Installation

## Install python dependencies
```bash
python3.7 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
pip install -e .
```


## Set environment variables

```bash
export OMP_NUM_THREADS=1
export STOLLEN_NUM_PROCESSES=4

# Adapt below as needed
export FLASK_APP="$PWD/stollen/server"
# Adapt below if you would rather install models elsewhere
mkdir models
export TRANSFORMERS_CACHE="$PWD/models"
```

### Details on environment variables
* `export OMP_NUM_THREADS=1` is needed as otherwise we don't benefit from multithreading (numpy hogs all threads).
* You can set `STOLLEN_NUM_PROCESSES` if you want to run the search on multiple CPUs/threads. Each thread processes a single vocabulary item in parallel. We used `export STOLLEN_NUM_PROCESSES=10` on an AMD 3900X CPU with 64 Gb of RAM.


## Install [Gurobi](https://www.gurobi.com/academia/academic-program-and-licenses/)

The linear programming algorithm depends on Gurobi.
It requires a license, see link above.


# Example Usage

## Verify a randomly initialised softmax layer

This script exists as a sanity check for our algorithms.
We assert that we can detect which points are internal to the convex hull.
To make this assertion we compare results to [QHull](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html).


Do 20 class weight vectors randomly initialised in 2 and 3 dimensions have Stolen Probability?

```bash
stollen_random --num-classes 20 --dim 2
stollen_random --num-classes 20 --dim 3
stollen_random --help   # For more details / options
```

If the dimension is 2 or 3 we also plot the resulting convex hull for visualisation purposes.
The result of the algorithm is also compared to the exact Qhull result if `dim < 10`.
The approximate algorithm will have 100% recall but may have lower precision.
```bash
stollen_random --num-classes 300 --dim 8 --seed 3  --patience 50
```

Below we run the exact algorithm, this should always return 100% for both precision and recall unless the input range is too large.
```bash
stollen_random --num-classes 300 --dim 8 --seed 3  --patience 50 --exact-algorithm lp_chebyshev
```


## Verify that Stolen Probability can be avoided by using weight normalisation

As a sanity check we verify that all classes are argmaxable when we normalise the weights or set the bias term as mentioned in Appendix D of the paper.

```bash
stollen_prevention --num-classes 500 --dim 10
stollen_prevention --num-classes 500 --dim 10 --use-bias
```

We can also see that the script would raise an assertion error if we did not follow the normalisation step.

```bash
stollen_prevention --num-classes 500 --dim 10 --do-not-prevent
stollen_prevention --num-classes 500 --dim 10 --use-bias --do-not-prevent
```

Note that in high dimensions Stolen Probability is not expected to occur if we randomly initialise the weight vectors.


## Verify a model stored in numpy.npz format

Expects the weight matrix to be in **decoder_Wemb** attribute.
Takes transpose, since expects the matrix in [dim, num_classes] format.

```bash
stollen_numpy --numpy-file path-to-numpy-model.npz
```


## Verify a model from HuggingFace

```bash
stollen_hugging --url https://huggingface.co/bert-base-cased --patience 2500 --exact-algorithm lp_chebyshev
```

NB: The script does not work with any arbitrary model: It needs to be adapted if the Softmax weights and bias are stored in an unforeseen variable.


# Reproducing the Paper Results

## More Installation Steps needed


### Install database
```bash
cd db

export DB_FOLDER="$PWD/stollen_data"
export DB_PORT=5436
export DB_USER=`whoami`
export DB_NAME=stollenprob
export DB_PASSWD="cov1d"
export PGPASSWORD=$DB_PASSWD
export DB_HOST="localhost"
export DB_SSL="prefer"

./install.sh
```

## Run experiments

Scripts to reproduce experiments can be found [here](experiments/stollen_search), see the README.md file for details.
The scripts generally write to a postgres database, but the ``save-db`` parameter can be toggled within the script to change that.

## Recreate tables and figures

The following scripts generally accept a file with experiment ids to plot/aggregate.
You can use these with the experiment ids generated if you run experiments and save them to the database.

```
paper/
├── appendix
│   ├── braid-slice-regions
│   └── check_quantiles
├── plots
│   ├── plot_bounded.py
│   ├── plot_random_iterations.py
│   ├── plot_row_iterations.py
│   ├── plot.sh
│   ├── stolen_probability.py
│   └── stolen_probability_with_convex.py
└── tables
    ├── plot_iterations.py
    └── print_bounded_table.py
```

We plan to release our own database files after the anonymity period.

# Related Work

* [Demeter(2020)](https://arxiv.org/abs/2005.02433) identified that this problem can arise in classification layers and coined it Stolen Probability.
* Warren D. Smith comprehensively summarises [the history of the problem](https://rangevoting.org/WilsonOrder.html)

# Trivia
As we get closer to Christmas, [stollen](https://en.wikipedia.org/wiki/Stollen) probability increases.
