Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models
Maeda Hanafi, Ishan Jindal, Yannis Katsis, Lucian Popa, Huaiyu Zhu
Abstract
Instruction fine-tuning enhances the alignment of autoregressive language models (ArLMs) with human intent but relies on large-scale annotated datasets prone to label and text noise. In this paper, we show that existing noise detection techniques designed for autoencoder models (AeLMs) do not directly generalize to ArLMs due to differences in learning dynamics. We propose TDRanker, a novel approach leveraging training dynamics to rank datapoints from easy-to-learn to hard-to-learn, effectively identifying noisy instances. Our method demonstrates robustness across multiple model architectures covering both autoencoder and autoregressive language models (GPT-2, BERT, LaMini-Cerebras-256M) and across various dataset noise levels, achieving at least 2x faster denoising than previous techniques. Applied to real-world classification and generative tasks, TDRanker significantly improves data quality and model performance. These findings suggest that TDRanker provides a scalable solution for refining instruction-tuning datasets, enhancing the reliability of fine-tuned ArLMs in practical applications.- Anthology ID:
- 2025.findings-emnlp.840
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15534–15550
- Language:
- URL:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.840/
- DOI:
- 10.18653/v1/2025.findings-emnlp.840
- Cite (ACL):
- Maeda Hanafi, Ishan Jindal, Yannis Katsis, Lucian Popa, and Huaiyu Zhu. 2025. Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 15534–15550, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models (Hanafi et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.840.pdf