Elena Merdjanovska
2025
Token-Level Metrics for Detecting Incorrect Gold Annotations in Named Entity Recognition
Elena Merdjanovska
|
Alan Akbik
Findings of the Association for Computational Linguistics: EMNLP 2025
Annotated datasets for supervised learning tasks often contain incorrect gold annotations, i.e. label noise. To address this issue, many noisy label learning approaches incorporate metrics to filter out unreliable samples, for example using heuristics such as high loss or low confidence. However, when these metrics are integrated into larger pipelines, it becomes difficult to compare their effectiveness, and understand their individual contribution to reducing label noise. This paper directly compares popular sample metrics for detecting incorrect annotations in named entity recognition (NER). NER is commonly approached as token classification, so the metrics are calculated for each training token and we flag the incorrect ones by defining metrics thresholds. We compare the metrics based on (i) their accuracy in detecting the incorrect labels and (ii) the test scores when retraining a model using the cleaned dataset. We show that training dynamics metrics work the best overall. The best metrics effectively reduce the label noise across different noise types. The errors that the model has not yet memorized are more feasible to detect, and relabeling these tokens is a more effective strategy than excluding them from training.
Measuring Label Ambiguity in Subjective Tasks using Predictive Uncertainty Estimation
Richard Alies
|
Elena Merdjanovska
|
Alan Akbik
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)
Human annotations in natural language corpora vary due to differing human perspectives. This is especially prevalent in subjective tasks. In these datasets, certain data samples are more prone to label variation and can be indicated as ambiguous samples.
2024
NoiseBench: Benchmarking the Impact of Real Label Noise on Named Entity Recognition
Elena Merdjanovska
|
Ansar Aynetdinov
|
Alan Akbik
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Available training data for named entity recognition (NER) often contains a significant percentage of incorrect labels for entity types and entity boundaries. Such label noise poses challenges for supervised learning and may significantly deteriorate model quality. To address this, prior work proposed various noise-robust learning approaches capable of learning from data with partially incorrect labels. These approaches are typically evaluated using simulated noise where the labels in a clean dataset are automatically corrupted. However, as we show in this paper, this leads to unrealistic noise that is far easier to handle than real noise caused by human error or semi-automatic annotation. To enable the study of the impact of various types of real noise, we introduce NoiseBench, an NER benchmark consisting of clean training data corrupted with 6 types of real noise, including expert errors, crowdsourcing errors, automatic annotation errors and LLM errors. We present an analysis that shows that real noise is significantly more challenging than simulated noise, and show that current state-of-the-art models for noise-robust learning fall far short of their achievable upper bound. We release NoiseBench for both English and German to the research community.