# QAConv: Question Answering on Informative Conversations

## Dataset
Unzip the `data.zip` file and files below are shown under the data folder.

* Question-Answer files (`trn.json`, `val.json`, `tst.json`)
```
[
  {
    "id": "tst-0",
    "article_segment_id": "newsdial-1068",
    "article_full_id": [
      "newsidal-NPR-170"
    ],
    "QG": false,
    "question": "Which contact number is available for callers on the line said by NEAL CONAN?",
    "answers": [
      "800-989-8255"
    ]
  }
]
```
* Document files (`article_segment.json`, `article_full.json`)
```
{
"newsdial-1068": {
    "prev_ctx": [
      {
        "id": "newsidal-NPR-170-133",
        "speaker": "AUBREY JEWETT",
        "text": "Up till about a week ago, I was among the many who thought, OK, in the end, Romney's going to pull it out, but I'll tell you, He is in a world of trouble right now in Florida. He may hang on, but Gingrich is really surging in the polls."
      }
    ],
    "seg_dialog": [
      {
        "id": "newsidal-NPR-170-134",
        "speaker": "NEAL CONAN, HOST",
        "text": "Lucy Morgan, what do you think?"
      },
      {
        "id": "newsidal-NPR-170-135",
        "speaker": "LUCY MORGAN",
        "text": "I think Romney will pull it out. The newest poll, the better of the polls, Quinnipiac, came out this morning giving Romney a one-point advantage, within the margin of error. But I think the advantage he has is the early vote and the establishment Republicans who are behind him."
      },
      ...
    ],
    "word_count": 204
  },
}
``` 
```
{
"newsidal-NPR-170": [
    {
      "id": "newsidal-NPR-170-0",
      "speaker": "NEAL CONAN, HOST",
      "text": "This is TALK OF THE NATION. I'm Neal Conan in Orlando. Gabby Giffords bows out of Congress, Michele Bachmann vows to return, Newt reborn in South Carolina, while Santorum struggles to stay afloat. It's Wednesday and time for a..."
    },
    {
      "id": "newsidal-NPR-170-1",
      "speaker": "RICK SANTORUM",
      "text": "These are not cogent thoughts..."
    },
    {
    ...
  ]
}
```

## Running Baselines

### Dependency
First, install requirements by `pip install -r requirements.txt`. 

If you encounter error while installing fairscale with error message `AttributeError: type object 'Callable' has no attribute '_abc_registry'`, try `pip uninstall typing` then redo the installation. 

### Retriever
* Run BM25 (./retriever)
```console
❱❱❱ cd retriever
❱❱❱ ./run_retriver.sh tst
```

* DPR-wiki
We release the retrieved top-1 results at `./retriever/output_retriever_rank_dpr-wiki.json`. Please check the [DPR repository](https://github.com/facebookresearch/DPR) for details.

### Free-form

* Preprocess (./data)
```console
❱❱❱ python convert_txt.py
```

* Zero-shot (./baseline/free_form/)
```console
❱❱❱ ./run_zs.sh
```

* Training (./baseline/free_form/finetuning/)
```console
❱❱❱ ./run_finetune.sh 0,1 2 allenai/unifiedqa-t5-base 8
```

* Inference (./baseline/free_form/finetuning/)
```console
❱❱❱ ./run_eval.sh 0 ../../../data/nmt/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base output/qaconv-allenai/unifiedqa-t5-base/prediction/
❱❱❱ ./run_eval.sh 0 ../../../data/nmt-bm25/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base-bm25 output/qaconv-allenai/unifiedqa-t5-base/prediction-bm25/
❱❱❱ ./run_eval.sh 0 ../../../data/nmt-dpr/ ../../../data/ output/qaconv-allenai/unifiedqa-t5-base/ unifiedqa-t5-base-dprwiki output/qaconv-allenai/unifiedqa-t5-base/prediction-dprwiki/
```

### Span-base

* Preprocess (./baseline/span_based)
```console
❱❱❱ cd ./baseline/span_based
❱❱❱ python preproc.py
```

* Training (./baseline/span_based)
```console
❱❱❱ ./run_qa.sh
```

* Inference (./baseline/span_based)
```console
❱❱❱ python test_pipe.py --gpu 0
```

### Evaluation 

* Evaluate one single prediction file (./)
```console
❱❱❱ python evaluate.py data/tst.json prediction/unifiedqa-t5-base-zeroshot.json
```

* Evaluate the whole folder with all the prediction files (./)
```console
❱❱❱ python evaluate.py data/tst.json prediction/ --folder
```

## Ethics
We have used only the publicly available transcripts data and adhere to their guideline, for example, the Media data is for research-purpose only and cannot be used for commercial purpose. 
As conversations may have biased views, for example, specific political opinions from speakers, the transcripts and QA pairs will likely contain them. The content of the transcripts and summaries only reflect the views of the speakers, not the authors' point-of-views. We would like to remind our dataset users that there could have potential bias, toxicity, and subjective opinions in the selected conversations which may impact model training. Please view the content and data usage with discretion.
