# Mirror-Consistency

## Overview
This project evaluates the performance of different large language models (LLMs) on the Mirror-Consistency metric using various datasets. The experiments are conducted using four major LLMs and can be customized by altering the model or dataset configurations.

## Models
We have utilized four LLMs for our experiments:
1. `gpt3.5-turbo-0613` - For using this model, provide the corresponding API by replacing the `_gpt35_api` function in `model.py`.
2. `qwen-turbo` - This model also requires the corresponding API which should be replaced in the `_qwen_turbo_api` function in `model.py`.
3. `Llama3-8B-Instruct` - To use this model, download the Hugging Face version of the model parameters and set the `model_path` parameter in `run.py` to the path of the downloaded model weights.
4. `Llama3-70B-Instruct` - Similar to the Llama3-8B model, ensure to download and correctly reference the model parameters in `run.py`.

To switch between these models, modify the `config.model_name` in `run.py`.

## Datasets
Our experiments utilize the following datasets:
- GSM8K
- SVAMP
- Date
- StrategyQA

To use a different dataset, update the `config.dataset_name` in `run.py`.

## Running Experiments
For batch Mirror-Consistency experiments:
1. Modify `run.py` with the desired model and dataset parameters.
2. Execute `run.py` directly to start the experiments.

For evaluation-related operations, utilize `complete_evaluation.py` which provides tools for detailed performance analysis.

## Additional Resources
- `check_pipeline.ipynb`: A Jupyter notebook that serves as a simple example of the generation process using the configured models.
- `check_plot.py`: A Python script that provides a straightforward example of how to plot and evaluate the experimental results.