# Adaptable Adapters

## Getting the code.

1. Clone this repository and update the submodules
```bash
git clone URL
cd adaptable-adapters
git submodule update --init
```

2. Install the dependencies at `requirements_cu111.txt` or `requirements_cu102.txt` (feel free to use a virtualenv)
```bash
pip install --upgrade -r requirements_cu111.txt
pip install -e ./adapter-transformers ./rational_activations
```
_Note:_ Installing `rational_activations` is non-trivial in some cases.

## How to run the code.

The code depends heavily on reporting data to Weights and Biases (www.wandb.ai), this is used to record the performance and the switch configurations.

There is a single command running all the code at `src/run.py`, the two classes of tasks are also independentily available in `src/run_glue.py` and `src/run_qa.py`.

Default hyper-parameters are hard-coded in the file `src/run.py` but you can easily change them.

All the tasks report or read results from a project in WandB. In case you want to simply fix that project export the envvar as follows before running any of the scripts:
```bash
export WANDB_PROJECT="adaptable-adapters"
```
or add it before the scripts
```bash
WANDB_PROJECT="adaptable-adapters" python src/run.py ...
```

For all the training baseline/switches/fixed you can control the batch size using the flag `--batch_size` as in
```bash
pyhon src/run.py baseline --task_name rte --seed 1 --low_resource 128 --batch_size 4
```
this will modify the huggingface variables `per_device_train_batch_size` and `per_device_eval_batch_size`.

### Training the baseline models

The baseline consist of BERT with the default adapter of AdapterHub (currently the Pfeiffer configuration.)

To train the baselines use the following
```bash
pyhon src/run.py baseline --task_name rte --seed 1 --low_resource 2048
```

In order to test the impact of adapters on top of BERT, we have as well simpler baselines. First, a frozen BERT where we only train the classification head:
```bash
pyhon src/run.py baseline --task_name rte --seed 1 --low_resource 2048 --bert_only
```
And the same model but activating the code associated to adapters by adding an adapter and leaving it out at all layers.
```bash
pyhon src/run.py baseline --task_name rte --seed 1 --low_resource 2048 --leave_out_all
```

### Training the switches.

The switches come with extra configurations. You can drop the skip-connections (or residual connections) inside the default adapters or you can activate a default square regularization.

To train the switches with two inputs, an identity function and the default adapter use the following
```bash
pyhon src/run.py switches --task_name rte --seed 1 --low_resource 2048
```
### Fixed configuration

After running the switches we can train the same configuration discovered by the switches but fixed. In principle, training the switches add a penalty on the ultimate performance the model can obtain. Due to the large number of differenc combinations of switche configurations (2^12) its not feasible to check all the options, training the swithces with the Gumbel softmax allow us to reduce that time and discover good candidates.

The final candidates are trained using the same parameters as switches.
```bash
pyhon src/run.py fixed --task_name rte --seed 1 --low_resource 2048
```
This will try to fetch a previous result of the same task, and same configurations, it captures the state of the switches at the best model (last )result

#### Dropping the skip-connections

If you want to drop the skip-connections in the adapters add the flag `--drop_skip_connections`:
```bash
pyhon src/run.py switches --task_name rte --seed 1 --low_resource 2048 --drop_skip_connections
```

#### Switches with regularization

If you want to activate the regularization use the flag `--with_regularization`:
```bash
pyhon src/run.py switches --task_name rte --seed 1 --low_resource 2048 --with_regularization
```


### Creation of Tables and Plots

To create the plots present in the paper and others, use the `tables` subcommand as follows
```bash
python src/run.py tables --baseline --low_resources 2048
```
The flag `--baseline` prints tables associated to the baselines, and the flag `--switches` prints tables associated to the experiments with switches.

You can obtain the same data for single tasks or seeds by using the flags `--task_name` and `--seed` but receive several names, eg:
```bash
python src/run.py tables --baseline --switches --low_resources 2048 --sed 1 2 3 --task_name rte mrpc
```

## Our Results

The results reported in the paper are the following:

- [low\_resources=128](TABLES_128.md)
- [low\_resources=256](TABLES_256.md)
- [low\_resources=512](TABLES_512.md)
- [low\_resources=1024](TABLES_1024.md)
- [low\_resources=2048](TABLES_2048.md)


# Rational Adapters and Gumbel-Softmax

The main goal of the current project is to use a Gumbel-Softmax function to select between two different adapters in a discrete fashion. In particular, we want to select between a rational and linear adapter across different layers.

First, this repository contains a submodule with the code of `adapter-transformers` modify to our needs. Since that package is not easy to extend without copying the whole tree we keep it as a submodule pointing to the branch containing our changes. To initiallize that submodule run

```bash
git submodule update --init
```

## Gumbel-Softmax

The switch is implemented as a new composition block called `Switch` it adds only a vector with the logits of the associated probabilities of the same size as inputs to the switch.

In the soft mode of the switch the output is the weighted sum of its inputs. In the hard mode the output is the input associated to the highest weight (argmax).

## Experiments

The following experiments are design based on GLUE. The baseline is bert-base-uncased with two adapters: pfeiffer and rational.

From the baseline we start several modifications, first adding switches on one and two layers exploring all the options. In total this accounts for 12 + 66 tests on each baseline. We refer to this as `switch_at_x` and `switch_at_x_y`. We proceed to put switches in consecutive layers of depth 2, 3, 4, and 6. This are refered as `switch2_x_y` for depth 2 and swtiches in `[x, x+1, y, y+1]`, there are in total 88 experiments.

We collect the following metrics: accuracy on the validation set (using soft and hard mode). Training and inference time.

### Results



#
