# PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation

This repository contains the code for our EACL 2023 paper:
- `PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation`

We have our proposed evaluation of SRL quality and the official evaluation
from the CoNLL2005 (span evaluation) and CoNLL2009 (head evaluation). We use an
SRL format that extends the CoNLL-U format (see SRL format section below).

We have the following pipelines:
- Proposed evaluation: uses our extended CoNLL-U+SRL format.
- CoNLL2005 evaluation: CoNLL-U conversion to CoNLL2005 format for the official scripts.
- CoNLL2009 evaluation: CoNLL-U conversion to CoNLL2009 format for the official scripts.

### NOTE: We have included the following data for reviewers' evaluation purposes and will not be part of the code release. The data is to be used for research purposes only.

CoNLL2009 data (all in our CoNLL-U+SRL format):
- `data/conll09.wsj.test.conllu` - In-domain data with gold labels.
- `data/conll09.wsj.pred.roberta-base.conllu` - In-domain data with predicted labels from the RoBERTa model.
- `data/conll09.brown.test.conllu` - Out-of-domain data with gold labels.
- `data/conll09.brown.pred.roberta-base.conllu` - Out-of-domain data with predicted labels from the RoBERTa model.

Data sources:
- CoNLL2005: https://catalog.ldc.upenn.edu/LDC99T42 and https://www.cs.upc.edu/~srlconll/soft.html
- CoNLL2009: https://catalog.ldc.upenn.edu/LDC2012T04 and https://ufal.mff.cuni.cz/conll2009-st/eval-data.html


## Usage

- (Recommended) Create a virtual env, e.g.
    - `conda create -n eval python=3.9`
    - `conda activate eval`
- Install requirements: `pip install -r requirements.txt`
- (Optional) Verify unit tests (takes about 2 mins): `pytest tests`
- Run evaluation script:
    - `python run_evaluations.py --gold-conllu <file> --pred-conllu <file> --output-folder <folder>`
    - See the `data` and `tests/data` folder for examples.
    - Example usages:
      - `python run_evaluations.py` - uses the default CoNLL2009 out-of-domain
      dataset (in our CoNLL-U format).
      - `python run_evaluations.py -g data/conll09.wsj.test.conllu -p data/conll09.wsj.pred.roberta-base.conllu -o tmp`

The evaluation script will show the results from the official CoNLL scripts and
our proposed evaluation method. Please see the paper on how to interpret and
compare these numbers.


## Examples from our paper

We have encoded all examples in our paper as unit tests. See `tests/README.md`
for how to match up numbers in the tests with those presented in the paper.

In short, the data for the tables are in the `tests/data/<evaluation>/input`
with similar naming scheme to the examples in the table. The evaluation results
presented in the paper are in these folders with this structure:
`tests/data/<evaluation>/expected/compare-*/comparison-results-[official-conll|proposed].csv`.

We provide a script to format these comparison results, example usages:
- `python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-official-conll.csv`
- `python format_results.py -c tests/data/sense/expected/compare-sense_test-sense_pred_p1/comparison-results-proposed.csv`


```

