# Data Processing
## 1) Get and convert the data
### SNIPS
First, download the data pre-processed by [(Goo at al, 2018)](https://www.aclweb.org/anthology/N18-2118/) from their 
[GitHub](https://github.com/MiuLab/SlotGated-SLU/tree/master/data/snips).

Then, merge files in the `train`, `valid`, `test` subdirectories to get three files : `label`, `seq.in` and `seq.out`.
```bash
git clone https://github.com/MiuLab/SlotGated-SLU/
cd SlotGated-SLU/data/snips
for f in label seq.in seq.out
do
cat train/$f valid/$f test/$f > $f
done
```

Finally, run the [process_snips.py](process_snips.py) script to convert these files to TSV format.

### TOP
First, download the data from the Facebook Dialog Corpus [website](http://fb.me/semanticparsingdialog).

Then, concatenate `train.tsv`, `eval.tsv` and `test.tsv`.

Finally, run the [process_top.py](process_top.py) script to convert the concatenate file to our TSV format.


### DSTC8

Run `download_and_preprocess_dstc8.sh`.

## 2) Remove long utterances

Run [remove_long_utterances.py](remove_long_utterances.py) to remove utterances that are too long in terms of subword 
units. Doing so speeds up the the training significantly while keeping most of the utterances.

## 3) Remove utterances without entities

Run [remove_all_other.py](remove_all_other.py) to remove utterances without any entities.

## 4) Splitting train, dev, test

To create the meta-partitions, run [split_train_dev_test.py](bin/split_train_dev_test.py).

