# Supplemental Content Web KPE
We have included some supplemental material we belive will help reviewers to better understand our research. Our main focus in these supplemental materials is the OpenKP Dataset, example results from our experiments(Inlcuding BlingKPE), Expert Annotations and a method to evaluate annotation quality.

## Files
1. features.tsv: 100 examples of the data.
2. predictions.tsv: predictions for various models and expert annotations on the examples in features.tsv
2. annotations.tsv: 275 example expert Annotations on 55 urls where each row is a unique human judgement judgement KeyPhrases are columns 16-18.
3. agreement.py is a simplification of the analysis script we used to explore judge agreement and quality. It will load all the annotations, find any with overlap and compute the pairwise agreement on EM and Unigram.

## Supplementary Data
### features.tsv
As mentioned above, file 1 is a 100 examples of the OpenKP dataset. The file is a tsv with the following format:
Url\tPlaintext_Token\tPlaintext_plus_Raw_Visual_Feature\n
#### Plaintext_plus_Raw_Visual_Feature Format

Plaintext_plus_Raw_Visual_Feature is json of [..., {node i}, {node i+1}, ...]. Find an example of Node i below. 
```
{"text": "Rate this", "self_visuals": [508.0, 53.0, 380.0, 20.0, 1.0, 0.0, 0.0, 12.0, 0.0], "parent_visuals": [508.0, 53.0, 380.0, 20.0, 1.0, 0.0, 0.0, 12.0, 0.0], "start_idx": 21}
```
Some notes about the structure
1. start_idx defines what token in the Plaintext_token array the features corespond to
2. Visual features are a list that has the raw unnormalized values in the following order
    a. position X - pixel
    b. width W - pixel
    c. position Y - pixel
    d. height H - pixel
    e. IsBlockElement
    f. IsInlineElement
    g. IsLeaf
    h. FontSize
    i. IsBold

### prediction.tsv
As mentioned above, file 2 is a 100 examples of the expert annotations and the predictions of our experimental models. The file is a tsv with the format as follows(Predictions are separated by ',') and can be joined with features.tsv via unique key or url:
Url\tExpert Annotation\tBlingKPE Predictions\tTF_IDF Predictions\tLoToR Predictions\n

### Annotator Agreement
In order to adequatley understand how difficult the OpenKP task is and how high quality the labels are we used file 3 and 4 mentioned above. To do our analyisis we load each judgment, clean the unigrams, and turn it into a set. Next, we do a greedy matching where we pair off the keyphrase pairs of gold and candidate that are most similair. Similarity is either an exact match(1,0) or unigram intersection. Column 14 is the unique url and 16-18 are the keyphrases in annotations.tsv. Since judges were not required to include all 3 annotaions in some examples, there are no columns 17 or 18.

#### Usage
```
python3 agreement.py
```

Example output
```
Pairwise Agreement @ Top KP
Exact Match max:1.0 min:0.03571428571428571 mean:0.647400049634736
Unigram max:1.0 min:0.03571428571428571 mean:0.647400049634736
Pairwise Agreement @ Top 2 KP
Exact Match max:1.0 min:0.0 mean:0.48295454545454547
Unigram max:1.0 min:0.0 mean:0.6312499999999999
Pairwise Agreement @ Top 3 KP
Exact Match max:1.0 min:0.0 mean:0.43513257575757575
Unigram max:1.0 min:0.0 mean:0.5766466750841751
```