Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition

Elena Merdjanovska; Alan Akbik

doi:10.18653/v1/2025.findings-emnlp.827

Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition

Abstract

Annotated datasets for supervised learning tasks often contain incorrect gold annotations, i.e. label noise. To address this issue, many noisy label learning approaches incorporate metrics to filter out unreliable samples, for example using heuristics such as high loss or low confidence. However, when these metrics are integrated into larger pipelines, it becomes difficult to compare their effectiveness, and understand their individual contribution to reducing label noise. This paper directly compares popular sample metrics for detecting incorrect annotations in named entity recognition (NER). NER is commonly approached as token classification, so the metrics are calculated for each training token and we flag the incorrect ones by defining metrics thresholds. We compare the metrics based on (i) their accuracy in detecting the incorrect labels and (ii) the test scores when retraining a model using the cleaned dataset. We show that training dynamics metrics work the best overall. The best metrics effectively reduce the label noise across different noise types. The errors that the model has not yet memorized are more feasible to detect, and relabeling these tokens is a more effective strategy than excluding them from training.

Anthology ID:: 2025.findings-emnlp.827
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 15292–15304
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.827/
DOI:: 10.18653/v1/2025.findings-emnlp.827
Bibkey:
Cite (ACL):: Elena Merdjanovska and Alan Akbik. 2025. Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15292–15304, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition (Merdjanovska & Akbik, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.827.pdf
Checklist:: 2025.findings-emnlp.827.checklist.pdf

PDF Cite Search Checklist Fix data