Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance

Florian Babl; Moritz Hennen; Jakob Murauer; Michaela Geierhos

Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance

Florian Babl, Moritz Hennen, Jakob Murauer, Michaela Geierhos

Abstract

In named entity recognition (NER), models are evaluated on their ability to identify entity mentions in text. However, standard evaluation methods often rely on test sets that contain named entities already present in the training data, raising concerns about overestimation of model performance.This work investigates the impact of varying degrees of entity contamination on a dataset level on the generalization ability and reported F1 scores of three state-of-the-art NER models.Experiments on five standard benchmarks show that F1 scores for contaminated entities statistically significantly inflate reported F1 scores as contamination rates increase, with F1 performance gaps ranging from 2-10% compared to entities not seen during training.To address these inflated F1 scores, we additionally propose a novel NER dataset splitting method using a minimum cut algorithm to minimize train-test entity leakage.While our splitting method ensures near-zero entity contamination, we also compare new and existing dataset splits on named entity sample counts.

Anthology ID:: 2025.findings-acl.504
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9724–9738
Language:
URL:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.504/
DOI:
Bibkey:
Cite (ACL):: Florian Babl, Moritz Hennen, Jakob Murauer, and Michaela Geierhos. 2025. Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance. In Findings of the Association for Computational Linguistics: ACL 2025, pages 9724–9738, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance (Babl et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/display_plenaries/2025.findings-acl.504.pdf

PDF Cite Search Fix data