Requirements
The python dependencies are mentioned in requirements.txt and packages can be installed using the command:
python3 -m pip install -r requirements.txt

Import Datasets
The replication of experiments require the presence of datasets in their appropriate folders. Twitter rules does not allow the distribution of the datasets, so the datasets concerning with Twitter need to be downloaded extrernally and placed in their respective folders. The libraries such as tweepy can be used to extract twitter text given the twitter id.

1. SMM4H Task 1
The preparation of AE Detection dataset for SMM4H Task 1 requires SMM4H19_Task1.csv file in the /data/datasets/SMM4H_Task1/ folder. The dataset can be downloaded using the details mentioned in their paper (https://www.aclweb.org/anthology/W19-3203.pdf). This kind of datasets usually do not contains the tweets and it can be downloaded using Twitter API. The format of the data that is expected in the folder is shown below:
tweet_id, tweet, label

2.SMM4H Task 2
Similarly as Task 1 dataset, the importer function expects a file SMM4H19_Task2.csv in the /data/datasets/SMM4H_Task2/ folder with the format as below:
tweet_id,begin,end,type,extraction,drug,tweet,meddra_code,meddra_term
The dataset can be downloaded using the details mentioned in their paper (https://www.aclweb.org/anthology/W19-3203.pdf)

3. CADEC v2
The importer for CADEC expects a zip file CADEC.zip in the /data/datasets/CADEC/ folder and the dataset is available at: https://data.csiro.au/collections/collection/CIcsiro:10948/SQcadec/RP1/RS25/RORELEVANCE/STsearch-by-keyword/RI1/RT1/ (download the CADEC.v2.zip) 

4. ADE Cropus v2
This dataset is automatically prepared by the code by loading the dataset from the huggingface datasets package.

5. WEB-RADR
This dataset can be downloaded using the link mentioned in their paper (https://link.springer.com/content/pdf/10.1007/s40264-020-00912-9.pdf).The importer for WEB-RADR expects the file WEB_RADR.csv in the folder /data/datasets/WEB_RADR/ with the following format. 
tweet_id, tweet, label, extraction

6. SMM4H French
The SMM4H Twitter AE French dataset was introduced in the SMM4H20 (https://www.aclweb.org/anthology/2020.smm4h-1.4.pdf) and the importer expects the file SMM4H_French.csv in /data/datasets/SMM4H_French/ with the following format
tweet_id, tweet, label

After all the datastes being placed in their respective folders, the following command can be executed to load and prepare all the datasets for the model input.

python3 prep_data.py


Running Experiments:

Baseline BERT Models:
The BERT baseline models for AE Detection can be trained using the train_baseline.py script and evaluated using the eval_baseline.py script. More details of the paramters that can be changed are mentioned in those scripts.

T5 Single Task Model:
The T5 model for a single task can be executed using the command: python3 t5_train.py. The dataset name and other hyperparameters can be changed in that script.

T5 Multi Task Model:
The T5 model can be trained in a multi-task setting with the command:
python3 t5_multi_task_train.py
There are couple of options for running the multi-task setting which are described in the script. The T5 model can be trained on Task Balancing (TB) or Task plus Dataset Balancing (TDB) approach for Proportional Mixing (PM) or Temperature Scaling (TS) strategies. 

T5 Evaluation:
The trained T5 model can be evaluated on the test set with the command:
python3 t5_eval.py
The test set and the trained model path information can be changed in t5_eval.py script.
 

