###It is a draft code and I will refine it to integrate different modules into fewer scripts as possible.

### File structure
1. The main entrance to train the model is in train_new.py in the root directory. We also provide some example shells
for running under different conditions.
2. The code related to data augmentation is under `./data_augmentation`
3. `./attention_experiment` contains scripts for the experiments related to attention in our paper
4. `./model` contains scripts for all other necessary parts to run experiment, including models, optimizer, data interface
and so on.

###Requirements
1. python == 3.7.0
2. torch==1.5.0
3. transformers==3.1.0
4. spacy==2.2.4
5. fairseq==0.9.0 (I downloaded the source code into the root directory)
6. sentencepiece==0.1.94

###First step
At first, we have to get all trained models we need in experiments.

#####1. NLI model for persona consistency
You need go to `./data_augmentation/prepare_model`, downloading DialogueNLI dataset from https://wellecks.github.io/dialogue_nli/ and 
put it under this directory. Also, downloading RoBERTa MNLI model from https://huggingface.co/roberta-large-mnli .

Then you need to train the NLI model using this dataset using script `train_nli_model.py`, which is needed in the
following steps.

#####2.NLI model for dialogue history
Using the same RoBERTa MNLI model and `train_coherence_nli.py` to train it on the InferConvAI2 dataset from 
https://github.com/nouhadziri/DialogEntailment . 

#####3.BERT and GPT2 model for diversification

First using `extract_personas_and_responses.py` to extract persona and response texts into two json files.

Then using `finetune_bert_and_gpt2.py` to fine tune BERT and GPT2 model on `personas.json`, obtaining $BERT_{per}$ and 
$GPT2_{per}$, then fine tune GPT2 on `responses.json` to obtain $GPT2_res}$.

#####4.Back translation model

Got to directory `./BT`.

Download WMT14 en-fr corpus from http://statmt.org/wmt14/translation-task.html#Download , and pre-processing it with 
BPE from sentencepiece, obtaining `sentence.bpe.model`.

Train en-fr and fr-en translation model using shells under this directory and the average the last 5 models using 
`average_model.sh`.

#####5.Dataset

Obtain PersonaChat dataset from ParlAI and put them into the `./datasets` directory.

###Data Distillation

Go to `./data_augmentation/data_distillation`.

Using `calculate_entailment.py` to obtained the predicted results given by RoBERTa NLI model. 

Then using `get_distilled_dataset.py` to obtain the distilled dataset using the previously obtained NLI logits.

###Data diversification

#####1.Multi-GPT2 model
At first you need to obtain a Multi-GPT2 model trained on the distilled samples. You can use the shell 
`train_multi_gpt2_distilled.sh` under the root directory.

#####2.Augment dialogue history
Then you need to augment dialogue history. Go to `./BT`, using `get_bt_input_file.py` to transform the distilled data 
into the format for back translation. Then use `bpe_split.py` to pre-process the newly obtained txt file with BPE. 

Using `evaluate.sh` and `evaluate_back.sh` you can translate all utterance into French and then back to English.

Finally, using `recover.py` you can recover the txt file into its original distilled data format in a json file.

#####3.Editing personas
Go to `./data_augmentation/data_diversification`. Using `generate_new_personas_and_edit_responses.py` you can obtain 
new personas as well as some samples with edited new responses if applicable.

Using `inference_multi_gpt2.sh` in the root directory you can get the predicted responses for the rest samples.

Using `get_augmented_scores.py` you can get the filter scores for each new sample.

Using `filter_augmented_data.py` you can get the filtered diversified samples along with the distilled one. They form
the augmented dataset used as an easy curriculum for training.

###Train model

Put the obtained augmented dataset into `./datasets/augmented/` and then you can train two models using 
`train_seq2seq_DD.sh` and `train_gpt2_DD.sh`.
