## Structure of the repository

```
├── preprocess.py       # preprocessing script for all available datasets
├── pretrain.py         # pretraining script
├── finetune_KIE.py     # finetuning script for the KIE task
├── finetune_RE.py      # finetuning script for the RE task
├── GPT_GNN             # main model
│   ├── config.py       # model configurations
│   ├── utils.py        # utilities used by the preprocessing script
│   ├── metric.py       # evaluation metrics used by the finetuning script
│   ├── conv            # graph convolutional layers
│   ├── model.py        # model architecture
├── layoutlmv3.py       # scipt to run layoutlmv3 experiments
├── GeoLayoutLM         # module to run the geolayoutlm experiments
├── requirements.txt
├── README.md
└── .gitignore
```


## Setting up the environment
1. Clone the repo and navigate to the home directory: `cd GPT-GNN`
2. Create a conda environment: `conda create -n myenv python=3.8`
3. Activate your environment: `conda activate myenv`
4. Install torch: `pip install torch==1.13.1`
5. Install the rest of the dependencies: `pip install -r requirements.txt`


## Running the pretraining code
Once your cenvironment is properly set up, you can simply run the pretraining code by running `python pretrain.py`. You can also pass command line arguments, which have been identified in the code.

Note that the pretraining code runs on the `FUNSD` dataset by default. You can modify the code to use a different dataset The script also includes some hyerparameters like `gpu_num` and `batch_size`. Take a look at the `main` method to see the list of command line arguments.

The pretrainer saves the model weights at the end of each epoch.


## Running the finetuning code
Once your cenvironment is properly set up and you've finished pretraining, you can simply run the finetuning code by running `python finetune.py`. 

Note that the finetuning code runs on the `FUNSD` dataset by default. You can modify the code to use a different dataset. The script also includes some hyerparameters like `gpu_num` and `batch_size`. Take a look at the `main` method to see the list of command line arguments.

Also note that the finetuning code needs to load your pretrained weights, so make sure to update the `path` variable to point to the pretrained model. 

The finetuning code saves the best-performing model from all epochs.


## Datasets

Below is a list of datasets for experimentation. Note that all datasets need to be preprocessed using the `preprocessed.py` script before use.

1. `idl`: This is a large collection of enterprise documents. The dataset is meant to be used for pretraining only.
2. `funsd`: This is a small collection of enterprise forms that is annotated for open-ended key information extraction. The four classes to be tagged in the document are: `header`, `question`, `answer`, and `other`.
3. `cord`: This is a collection of receipts with 30 classes to be tagged (e.g. total price, etc.)
4. `sroie`: A different collection of receipts with 8 classes.
4. `buddie`: A collection of business entity filings from various US states with 67 classes.


## Configurations
Most model configuration are available in the `GPT_GNN/config.py` script. 

Configurations specific to the preprocessing script are available in `preprocess.py`. For example, you can run:
`python preproces.py -d funsd -g alignet -n node -w 8`

Configurations specific to the pretraining script are available in `pretrain.py`. For example, you can run:
`python pretrain.py -d idl -g 4 -b 4 -e 1 -m gengnn -o models/idl.bin`

Configurations specific to the finetuning script are available in `finetune_KIE.py` and `fine_tune.RE.py`. For example, you can run:
`python finetune_KIE.py -d funsd -g 4 -b 16 -e 1000 -m gengnn -i models/idl.bin`
`python finetune_RE.py -d funsd -g 4 -b 16 -e 1000 -m gengnn -i models/idl.bin`