DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

Qi Zhang; Huitong Pan; Zhijia Chen; Longin Jan Latecki; Cornelia Caragea; Eduard Dragut

DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, Eduard Dragut

Abstract

Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.

Anthology ID:: 2025.findings-naacl.137
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2540–2556
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.137/
DOI:
Bibkey:
Cite (ACL):: Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, and Eduard Dragut. 2025. DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2540–2556, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.137.pdf

PDF Cite Search Fix data