DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, Eduard Dragut
Abstract
Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.- Anthology ID:
- 2025.findings-naacl.137
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2025
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Luis Chiruzzo, Alan Ritter, Lu Wang
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2540–2556
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.137/
- DOI:
- Cite (ACL):
- Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, and Eduard Dragut. 2025. DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2540–2556, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition (Zhang et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.137.pdf