1. Download original data with `download_arxiv.sh`
2. Flatten jsons into text with `preprocess_arxiv.py`
3. Go into the main directory and start `run_all.sh`

Based on 'raw', the other dirs will be populated with data with the ../run_all.sh scripts.
The raw in the folder contains *now* only 100 first rows of each file, as conference limits will not allow the full(2.2Gb) dataset to be uploaded. The dataset is already released and can be downloaded and preprocessed using our scripts.
