# Training and Testing Data
## Original Data
For the evaluation, we used
- [SNIPS](https://github.com/snipsco/nlu-benchmark) licensed under the Creative Commons Zero v1.0 Universal;
- [Task Oriented Parsing (TOP) a.k.a. Facebook Dialog Corpus](http://fb.me/semanticparsingdialog) licensed under the CC-BY-SA;
- [Schema-Guided Dialogue State Tracking (DSTC 8)](https://github.com/google-research-datasets/dstc8-schema-guided-dialogue) licensed under the CC-BY-SA.

## Pre-processed data
The subdirectories include the pre-processed data for each of the datasets mentionned above
- [SNIPS](snips/),
- [TOP](facebook/),
- [DSTC8](dstc8/).

Each subdirectory contain
- one file, `all.tsv`, containing all the utterances;
- three files, `train.tsv`, `valid.tsv`, and `test.tsv`, representing the split of `all.tsv` into, respectively, the meta-train, meta-dev and meta-test partitions;
- a directory, `alphabets`, describing the vocabulary used for building the models.

To rebuild the alphabets, one can simply run from the root
```bash
PYTHONPATH=lib/
python bin/build_alphabets.py --data data/snips/all.tsv --bert_model_path bert-base-uncased --alphabets_folder data/snips/alphabets
```

## License
The preprocessed data is licensed under [CC-BY-SA](LICENSE.txt).

