# Code for ARR Oct

## Requirements

### Environment:

- Python 3.10.12
- Ubuntu 22.04

### Setup:
```
# Create python environment (optional)
conda create -n pyt1.12 python=3.10.12

# Install pytorch with cuda (optional)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install python dependencies
pip install -r requirements.txt

# Download NLTK data
python -m nltk.downloader punkt
```

### Data
`annotation` folder contains annotation guidelines and fine-grained entity ontology.
`CHEMET` folder contains full CHEMET dataset and its few-shot subsets. Each folder contains four files: `train.json`, `valid.json`, `test.json`, and `types.json`.
`ChemNER+` folder contains full ChemNER+ dataset and its few-shot subsets. Each folder contains four files: `train.json`, `valid.json`, `test.json`, and `types.json`.
`train.json`, `valid.json`, `test.json` are used for training, validation, and testing respectively. Each file contains multiple lines. Each line represent an instance. The schema for each instance is listed below:
```

{
    "coupling":        #   sentence id
    "sent_tokens":     #   tokens in the sentence
    "entities":        #   ground truth entities in the sentence, which is a list containing entity type, text, start position, end position
    "f1":              #   semantic similarity between entity list and input
    }
```


## Finetuning
First, unzip and put data.zip under this folder.
Modify file directory under `pretrain.sh` and `finetune_cl.sh`.

You can fisrt pretrain your self-validation model by running `pretrain.sh` in this folder. 
```
bash pretrain.sh 
```

You can then finetune your model by running `finetune_cl.sh` in this folder. 
```
bash finetune_cl.sh 
```