# SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

This repository contains the implementation of the paper:

> SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

## Installing dependencies

We recommend installing required dependencies in a new Anaconda environment via these commands:

```commandline
conda create -n synthesizrr python=3.10.9 --yes  
conda activate synthesizrr

pip install -U "ray==2.5.1" "ray[default]" "ray[tune]" "ray[serve]" "dask[complete]" "torch==2.0.1" "pandas==1.*" "numpy==1.*" "tabulate==0.9.0" "ray==2.5.1" tiktoken s3fs scikit-learn tqdm bs4 "boto3==1.33.13" "urllib3==1.26.16" orjson transformers sentence_transformers spacy-transformers spacy tokenizers einops datasets safetensors nltk torchvision pyarrow fastparquet "pydantic==1.10.13" "cloudpickle==2.2.1" "gpustat==1.0.0" accelerate sentencepiece tensorboard "aim==3.*" "faiss-cpu==1.7.4" wandb hvplot holoviews matplotlib bokeh plotly-express jupyter evaluate mauve-text && pip install git+https://github.com/huggingface/transformers && pip install git+https://github.com/huggingface/accelerate && pip install git+https://github.com/huggingface/huggingface_hub  && pip install git+https://github.com/huggingface/accelerate@956114ac92cfbdfe0874ca73aa37ac815326f040 && pip install git+https://github.com/huggingface/transformers@fc63914399b6f60512c720959f9182b02ae4a45c

python -m spacy download en_core_web_lg
python -c "import nltk; nltk.download('punkt');"
```

## Code structure

`synthesizrr/base/` contains utility functions and classes.

`synthesizrr/expts/` contains code to reproduce the experiments.

## Running the code
1. Setup `DATA_DIR`:
   - Download the datasets into a local folder `DATA_DIR`. 
   - Inside `synthesizrr/expt/data.py`, set the variable `DATA_DIR` (marked TODO) to the above folder.
   
2. Setup `CORPUS_DIR`:
   - Download the corpora into a folder `CORPUS_DIR`. 
   - We recommend using S3 for this since the corpora are large.
   - Inside `synthesizrr/expt/corpus.py`, set the variable `CORPUS_DIR` (marked TODO) to the above folder.

3. Setup `RESULTS_DIR`:
   - Inside `synthesizrr/expt/common.py`, set the variable `RESULTS_DIR` (marked with TODO) to a different folder. Intermediate datasets and metrics will be saved here. 
   - We recommend using S3 for this since the file-paths are long.

4. Start a Ray cluster:
   - On the Ray head node, run: `ray start --head`
   - On the Ray worker nodes, run `ray start --address='<head node IP address>:6379'`
   - At the top of the files `data.py`, `corpus.py`, `main.py`, add the following to connect to the Ray cluster:
```commandline
import synthesizrr
import ray
from ray.util.dask import ray_dask_get, enable_dask_on_ray, disable_dask_on_ray
from pprint import pprint
pprint(ray.init(
    address='ray://<head node IP address>:10001',
    ignore_reinit_error=True,
    _temp_dir=str('/tmp/ray/'),
    runtime_env={"py_modules": [
        synthesizrr,
    ]},
))
enable_dask_on_ray()
pprint(ray.cluster_resources())  ## Shows you number of cpus and gpus.
```

5. After modifying the code to set `DATA_DIR`, `CORPUS_DIR` and `RESULTS_DIR`, and starting the Ray cluster, run the following:
   - First, run `cd synthesizrr/expts/ && python3 data.py` to create the datasets. (You will need to download certain datasets to `DATA_DIR` folder beforehand). 
   - Next, run `cd synthesizrr/expts/ && python3 corpus.py` to create the corpora (**warning**, this step needs a lot of compute! Make sure you setup the Ray cluster and use a big machine with at least a few hundred GB of RAM as the head node). 
   - Finally, run the file `cd synthesizrr/expts/ && python3 main.py` to reproduce the experiments.

## License
This project is licensed under the Apache-2.0 License.