Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Zhendong Chu; Ruiyi Zhang; Tong Yu; Rajiv Jain; Vlad Morariu; Jiuxiang Gu; Ani Nenkova

doi:10.18653/v1/2024.findings-naacl.14

Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances

Zhendong Chu, Ruiyi Zhang, Tong Yu, Rajiv Jain, Vlad Morariu, Jiuxiang Gu, Ani Nenkova

Abstract

To achieve state-of-the-art performance, one still needs to train NER models on large-scale, high-quality annotated data, an asset that is both costly and time-intensive to accumulate. In contrast, real-world applications often resort to massive low-quality labeled data through non-expert annotators via crowdsourcing and external knowledge bases via distant supervision as a cost-effective alternative. However, these annotation methods result in noisy labels, which in turn lead to a notable decline in performance. Hence, we propose to denoise the noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. The discriminator is capable of detecting both span and category errors with different discriminative prompts. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.

Anthology ID:: 2024.findings-naacl.14
Volume:: Findings of the Association for Computational Linguistics: NAACL 2024
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Editors:: Kevin Duh, Helena Gomez, Steven Bethard
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 196–210
Language:
URL:: https://aclanthology.org/2024.findings-naacl.14
DOI:: 10.18653/v1/2024.findings-naacl.14
Bibkey:
Cite (ACL):: Zhendong Chu, Ruiyi Zhang, Tong Yu, Rajiv Jain, Vlad Morariu, Jiuxiang Gu, and Ani Nenkova. 2024. Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 196–210, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):: Self-Cleaning: Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances (Chu et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.findings-naacl.14.pdf

PDF Search