1.Setup
        conda env create -f IM2.yaml
      or pip install -r requirements.txt


2.Datasets

    1. DSTC10 Human Annotation Data:

    2. get the NUF CR IES and Overall Datasets:
                python prepross.py

    3.   ./datasets contains dailydialog datasets,AB-BA datasets for training ab-ba
    sub metric and dstc9 datasets for generalization test on our IM2


3.Pretrained Model

     1.DialoGPT-medium: we use the A State-of-the-Art Large-scale Pretrained Response generation model (DialoGPT) as a part
      of D-PPL sub metric and AB-BA sub metric.   We use DialoGPT as a language model directly to get the ppl of response.
      To use D-ppl sub metric or train AB-BA metric, you should directly download the model weight.
       model url:https://huggingface.co/microsoft/DialoGPT-medium

     2.we also had trained the roberta-base model on PersonChat Dataset , it stored in ./ckpt. of course,you should let DialoGPT
     model into ckpt folder. we use the roberta-base model trained on PersonChat Dataset as a part of D-MLM sub metric  and
     5-IES/5-NUF sub metric to evaluate different quality of response.


4.Train sub metirc
        1.ab_ac metric
                     cd ./ab_ac
                     python train.py  --train-ctx-path datasets/dailydialog/dailydialog_train_ctx.txt \
                    --train-res-path datasets/dailydialog/dailydialog_train_res.txt \
                    --valid-ctx-path datasets/dailydialog/dailydialog_valid_ctx.txt \
                    --valid-res-path datasets/dailydialog/dailydialog_valid_res.txt \
                    --batch-size 16 \
                    --max-epochs 2 \
                    --ctx-token-len 50 \
                    --res-token-len 25

        2.ab_ba metric
                    cd ./ab_ba
                     python train.py
                    --batch-size 16 \
                    --max-epochs 2 \
                    --ctx-token-len 50 \
                    --res-token-len 25

         3. grade: we use the grade(Huang,2020) as the word/topic level coherence in CR-Metric,you shold follow steps :
               1. cd ./grade/texar-pytorch   2.pip install .    3.cd ./script
               4.bash preprocess_training_dataset.sh
               5.To train GRADE:    cd ./script   and run bash train.sh
               6.Finally we can use grade directly on own dialog dataset:  bash inference.sh
               (the detail steps: https://github.com/li3cmz/GRADE)

          4. 5-NUF / 5-IES
                     cd ./nuf-class  (or cd ./ies-class)
                     python train.py
                     --train-ctx-path datasets/NUF/dialog.txt
                     --train-label-path datasets/NUF/label.txt
                    --batch-size  \
                    --max-epochs  \


5.Train FI/NUF/CR/IES metric on NUF/CR/IES Datasets for inner linear weights
        FI = W1 ∗ D-PPL +W2 ∗ LT R + W3 ∗ LR
        NUF = W4 ∗ LSC + W5 ∗ V UP + W6 ∗ 5-NUF
        CR = W7 ∗ GRADE + W8 ∗ AB-AC +W9 ∗ AB-BA
        IES = W10 ∗ Dist-n +W11 ∗ D-MLM W12 ∗ 5-IES

        python regression.py --model_score_path   --human_anno_path

6.Train IM2 framework on Overall Datasets to get linear weights

        IM2 = W13 ∗ FI + W14 ∗ NUF + W15 ∗ CR + W16 ∗ IES

        python regression.py --model_score_path   --human_anno_path


7.predict:
    1. get sub metric scores on your datasets,
            if you want to get the ab_ba / ab_ac score ,you should :  cd ./ab_ac and run the following command:
              python predict.py \
                --weight-dir path_to_directory_of_all_weights \
                --context-file path_to_ctx_file \
                --response-file path_to_res_file
            or you and run the following command to get the dist-n / d-ppl / LR score ...
            cd ./dist-n  and run python diversity_evaluation.py  to get the dist-n score;
            cd  ppl-dialogpt and run python DialoGPT-ppl.py  to get the FI-ppl score;
            cd  ./length_ratio and run python ratio-score.py to get the LR score,
            .....

    2. combine the sub metric scores to get categorical scores(nuf/cr/ies scores) on datasets

    3. use linear weights you just trained to combine the scores to product the finally score IM2 score


Hint:
            1.Each metric in the IM2 framework can be used alone or only IM2 itself,We recommend that for different quality ,use the
        different metrics in the framework for optimal correlation ,etc, use IES-Metric to evaluate whether the dialog responses is
        interesting or engaging
            2.For time reasons, this is only the basic code implementation of this paper, and if accepted, it will perfect the code
             and open the source in github