FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only
He Zhu, Yifan Ding, Yicheng Tao, Zhiwen Ruan, Yixia Li, Wenjia Zhang, Yun Chen, Guanhua Chen
Abstract
Instruction tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly proprietary LLMs. Recent works explore approaches to synthesize data with open-sourced LLMs but require high-quality human-crafted seed data. In this work, we introduce , an end-to-end framework to synthesize high-quality instruction data with open-sourced LLMs and sampled unlabeled documents, eliminating the necessity for seed data. Starting from diverse pre-screened documents, the framework synthesizes complex and diverse high-quality instruction and response pairs in different stages. We propose a tagging-based prompt method to generate diverse and complex seed data and a UCB-based approach to augment more instruction data with the seed data. A novel Think Different prompt is proposed to address the distributional limitations of the seeds, further boosting the data diversity. Experiments prove that the can generate diverse and complex high-quality data even with a opensource small teacher model. The synthesized instruction data demonstrates performance that is comparable to, or even surpasses, baseline annotation methods with proprietary LLMs or open-sourced LLMs while requiring fewer instruction data samples.- Anthology ID:
- 2025.findings-acl.906
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2025
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venues:
- Findings | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17633–17653
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.906/
- DOI:
- Cite (ACL):
- He Zhu, Yifan Ding, Yicheng Tao, Zhiwen Ruan, Yixia Li, Wenjia Zhang, Yun Chen, and Guanhua Chen. 2025. FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17633–17653, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only (Zhu et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/ingestion-acl-25/2025.findings-acl.906.pdf