# **Uncertainty in Language Models: Assessment through Rank-Calibration**

## TL;DR
We provide a principled, practical, and unified assessment framework for uncertainty/confidence measures of language models. Our assessment is compatible with diverse uncertainty ranges and does not require binarization of correctness scores.


## Getting Started

- Create virtual environment using
```
python -m venv rce
pip install -r requirements.txt
```

- Before using OpenAI APIs, make sure you have the API key `OPENAI_API_KEY` updated in ./run/.env.

## File Structure
- `indicators`: uncertainty measure implementations
- `metrics`: correctness and calibration metrics, e.g., rank-calibration, ECE, etc
- `models`: OpenAI and opensource model implementations
- `run`: functions exposed to user to generate responses, calibrate uncertainty/confidence, and compute evaluation stats
- `submission`: scripts and files to reproduce results reported in submission
- `tasks`: different datasets loading implementation
- `utils`: miscellaneous functions implementation

## Reproduce Results
- Download collected data from https://drive.google.com/drive/folders/1geL5Sc1qVZBCH4Ytd-Uf6lY97MXyMUfc?usp=sharing
- Unzip *calibration_results.zip* and *evaluation_stats.zip* to folder *submission*
- To plot indication diagrams, uncertainty/correctness distributions on all experiment configurations:
```
cd submission
./bash/make_plots.sh
```
- To plot RCE boxplots, critical difference diagrams on all experiment configurations:
```
cd submission
python make_tables.py
```
In both cases, plots will be saved under `calibration_results` and `evaluation_stats` folder with the folder names indicating the corresponding experiment configuration.

### License
This codebase is released under [MIT License](LICENSE).

