## Data

The scripts in this directory are used to create data required in other steps.

1. "extract_and_merge.py" - Used to merge comments and posts from Pushshift data dumps into a single json file. We create a single json file for each month's data dump. These files will then get merged in the next step.

Usage
```
python .\extract_and_merge.py --year 2019 --month 5 --data_dir path/to/reddit_dir/ --out_file_suffix '_experiment1' --subreddits_json subreddits_to_keep.json

# "path/to/reddit_dir/" is expected to have folders called "submissions" and "comments" within 
# which the data dumps from reddit is expected to be present.
# out_file_suffix is just a suffix added to the json output. This can be used to make the files unique.
# Helps keep track of data for different experiments.
# "subreddits_json" should point to a json file with a list containing the names of subreddits to retain
# in the output json file. Ex: ["uiuc", "berkeley", "UTAustin", "gatech", "aggies"].
```

2. "create_db.py" - Used to create SQlite databases that combine the data from the previous step. We can create one database for each subreddit.

Usage
```
python create_db.py --json_dir "/mnt/f/reddit/out/mix" --db_name "/mnt/f/reddit/out/mix/mix.db" --json_suffix "_mix" --split_db_by_subreddit 1

# if "split_db_by_subreddit" is 1, a DB will be created for each subreddit. Since the "db_name" is "/mnt/f/reddit/out/mix/mix.db" here, the
# resulting dbs for each subreddit will be of the form "/mnt/f/reddit/out/mix/mix_{subreddit}.db". A database is created for each subreddit
# in the data. If you wish to select only certain subreddits, this can be done in the previous step using the "subreddits_json" parameter.
```

3. "create_test_data.py" - Used to create a test set consisting of equal number of positive and negative data points from various subreddits.

Usage
```
python create_test_data.py --subs subreddits_list.json --data_dir dir/with/sub/dbs/ --out_dir output/dir/

# "subs" should point to a json file with a list containing the names of subreddits dbs to form the test
# set from. Ex: ["uiuc", "berkeley", "UTAustin", "gatech", "aggies"].
# "data_dir" is expected to have SQlite databases for each subreddit. DB name should be of form
# "{--prefix}{sub_name}{--suffix}.db"
```