# CoDA Analysis

Please note that this folder contains all the files that are used for experimentation shown in our work, but the the directory structure may have to be changed for proper replication of the experiments.

## Dependencies

All dependencies used in our implementation are also mentioned in our paper.

- matplotlib
- numpy
- lexicalrichness
- pandas
- sklearn
- nltk
- spacy
- pyspellchecker
- convokit

## Organization

Unfortunately, we were not able to describe in detail how each of the code has been used to gain data. For the final version, we would like to clearly detail out the steps for the use of each code in detail and make our repository public.

## Contents

- corpus_analysis: contains code for analysis of CoDA
- src: contains code for analysis of DUTA / Surface Web + plotting figures for all three datasets
- model: contains code for text classification
- usecase: contains code for the use case. The 2nd use case that uses BERT is placed with the bert classifier in the model directory instead

## Datasets

The DUTA-10K Dataset can be found here: http://gvis.unileon.es/dataset/duta-darknet-usage-text-addresses-10k/

The three datasets used for the surface web can be found here:
Wikitext-2: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
IMDb movie reviews: https://ai.stanford.edu/~amaas/data/sentiment/

Reddit dataset: can be generated using the code attached:
src/make_reddit_data.py