# Code To Replicate "Contrastive Entity Disambiguation for Large Scale Historical Text"

This codebase contains scripts for training and evaluating our entity disambiguation method.

This README provides an overview of the codebase and its main functionalities. For more details on the individual functions and their arguments, please refer to the docstrings in the code.


## Evaluation 
`eval_disamb_benchmarks.py` evaluates our method and reproduces the main results shown in the paper. 

The `evaluate` function returns the overall accuracy, in-Wikipedia accuracy, and not-in-Wikipedia accuracy (if applicable).


## Training 
`nlp_utils` contains wrappers around sbert and a custom sbert/biencoder implementation that can train on multiple GPUs 

Training scripts are given in `entity_disambiguation > Training`. The model consists of both a coreference step and a disambiguation step. 

The coreference model is trained in `train_pytorch_bienc_coref.py` and the disambiguation model is trained in `train_pytorch_bienc_disamb.py`


## Data

We also introduce new benchmarks for entity dismabiguation, which contain out of Knowledge-Base entities. 

### Newspaper data

The EOTU dataset introduced in the paper is included in `data`. 


### Wikipedia data

Wikipedia data used to train the models is not included, due to size constraints, but will be released at a later date. 

`entity_disambiguation > KB Wrangling` contains helper scripts to proprocess data files. 

