# Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling

## System Requirements

- **Operating System**: Ubuntu 20.04
- **Environment Management Tool**: Anaconda or any other suitable tool
- **Python Version**: 3.11
- **GPU**: At least one A100 or higher recommended

## Create a New Environment

1. Open your terminal.
2. Create and activate a new conda environment:

```sh
conda create -n gac_env python=3.11
conda activate gac_env
```

## Installation

### Install `lm-evaluation-harness`

```sh
cd [root-of-this-repo]/lm-evaluation-harness
pip install -e .
```

### Install GaC Required Packages

```sh
cd [root-of-this-repo]/GaC
pip install -r requirements.txt
```

## Configure `config_api_server.py`

Configure the models to be used for ensemble in `GaC/config_api_server.py`:

```jsx
NORM_TYPE_API_SERVER = 'average' // must be one of 'average'/'ece_norm'/'score'
THRESHOLD_API_SERVER = 1.0

CONFIG_API_SERVER = [
    {
        "weight": "[Please replace the path with the local model weight]",
        "max_memory": {0: "66GiB"},
        "num_gpus":1.0,
        "name":"Meta-Llama-3-8B-Instruct",
        "score": 65.08,
        "ece":13.07,
        "priority":"supportive", // must be one of 'primary'/'supportive'
    },
    {
        "weight": "[Please replace the path with the local model weight]",
        "max_memory": {0: "70GiB",1: "80GiB"},
        "num_gpus":2,
        "name":"Meta-Llama-3-70B-Instruct",
        "score": 79.68,
        "ece": 9.49,
        "priority":"supportive",
    }
]
```
> Note: Please ensure that the number of GPUs on your computer >= the sum of all num_gpus values.

### Explanation of Parameters

- **CONFIG_API_SERVER**: List of models to be used in ensemble. Each model configuration includes:
  - **weight**: Local path to the model weight. Download from Hugging Face and replace the path.
  - **max_memory**: Controls how much memory each GPU uses. Since each model is managed independently by Ray, the GPU IDs always start from **0**.
  - **num_gpus**: Number of GPUs allocated to this model. Controlled by `ray`. To load two models on one GPU, set `num_gpus` to 0.5 for both models.
  - **priority**: If all models are 'supportive', all tokens will be ensembled. For ensemble with threshold, set the gate model's "priority" to "primary".
- **NORM_TYPE_API_SERVER**: Ensemble weight normalization type.
- **THRESHOLD_API_SERVER**: Threshold for ensemble. Ineffective if all models are supportive.

## Selecting Ensemble Model Combination

All supported models are listed in `GaC/support_models.py`.

## Run Benchmarks

After configuring `GaC/config_api_server.py`, start the server and load the model to the GPU:

```sh
cd [root]/GaC
uvicorn api_server:app --host 0.0.0.0 --reload // Please wait until the startup is complete.
```

Open another terminal after the server has started:
```sh
lm_eval --model TSP --batch_size [batch-size] --tasks [task-name] --num_fewshot [shots] --output_path [result-path] --log_samples
```
> Note: TSP is our internal code name during development.

Replace the values inside `[]` with appropriate values. For `[task-name]`, the tasks used in the paper include:

- mmlu_flan_n_shot_generative (5-shots)
- gsm8k (5-shots)
- bbh_fewshot (3-shots)
- triviaqa (5-shots)
- nq_open (5-shots)

### Example for MMLU

```sh
lm_eval --model TSP --batch_size 1 --tasks mmlu_flan_n_shot_generative --num_fewshot 5
```

## License

This project is licensed under the terms of the MIT license.