Abstract
Classifiers commonly make use of pre-annotated datasets, wherein a model is evaluated by pre-defined metrics on a held-out test set typically made of human-annotated labels. Metrics used in these evaluations are tied to the availability of well-defined ground truth labels, and these metrics typically do not allow for inexact matches. These noisy ground truth labels and strict evaluation metrics may compromise the validity and realism of evaluation results. In the present work, we conduct a systematic label verification experiment on the entity linking (EL) task. Specifically, we ask annotators to verify the correctness of annotations after the fact (, posthoc). Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology. Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth. We conclude with a discussion on these findings and recommendations for future evaluations. The source code, raw results, and evaluation scripts are publicly available via the MIT license at https://github.com/yifding/e2e_EL_evaluate- Anthology ID:
- 2022.dadc-1.3
- Volume:
- Proceedings of the First Workshop on Dynamic Adversarial Data Collection
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, WA
- Editors:
- Max Bartolo, Hannah Kirk, Pedro Rodriguez, Katerina Margatina, Tristan Thrush, Robin Jia, Pontus Stenetorp, Adina Williams, Douwe Kiela
- Venue:
- DADC
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 23–29
- Language:
- URL:
- https://aclanthology.org/2022.dadc-1.3
- DOI:
- 10.18653/v1/2022.dadc-1.3
- Cite (ACL):
- Yifan Ding, Nicholas Botzer, and Tim Weninger. 2022. Posthoc Verification and the Fallibility of the Ground Truth. In Proceedings of the First Workshop on Dynamic Adversarial Data Collection, pages 23–29, Seattle, WA. Association for Computational Linguistics.
- Cite (Informal):
- Posthoc Verification and the Fallibility of the Ground Truth (Ding et al., DADC 2022)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2022.dadc-1.3.pdf
- Code
- yifding/e2e_EL_evaluate