Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection

Hong Zhang; Feng Zhao; Ruilin Zhao; Cheng Yan; Kangzheng Liu

Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection

Hong Zhang, Feng Zhao, Ruilin Zhao, Cheng Yan, Kangzheng Liu

Abstract

Large Language Models (LLMs) have demonstrated a remarkable understanding of language nuances through instruction tuning, enabling them to effectively tackle various natural language processing tasks. Recent research has focused on the quality of instruction data rather than the quantity of instructions. However, existing high-quality instruction selection methods rely on external models or rules, overlooking the intrinsic association between pre-trained model and instruction data, making it difficult to select data that align with the preferences of pre-trained model. To address this challenge, we propose a strategy that utilizes noise injection to identify the quality of instruction data, without relying on external model. We also implement the strategy of combining inter-class diversity and intra-class diversity to improve model performance. The experimental results demonstrate that our method significantly outperforms the model trained on the entire dataset and established baselines. Our study provides a new perspective on noise injection in the field of instruction tuning, and also illustrates that the pre-trained model itself should be considered in defining high-quality. Additionally, we publish our selected high-quality instruction data.

Anthology ID:: 2025.emnlp-main.1048
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20775–20787
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1048/
DOI:
Bibkey:
Cite (ACL):: Hong Zhang, Feng Zhao, Ruilin Zhao, Cheng Yan, and Kangzheng Liu. 2025. Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20775–20787, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Priority on High-Quality: Selecting Instruction Data via Consistency Verification of Noise Injection (Zhang et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1048.pdf
Checklist:: 2025.emnlp-main.1048.checklist.pdf

PDF Cite Search Checklist Fix data