From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding

Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao


Abstract
The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora.
Anthology ID:
2025.acl-long.517
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10516–10543
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.517/
DOI:
Bibkey:
Cite (ACL):
Chiwei Zhu, Benfeng Xu, Xiaorui Wang, and Zhendong Mao. 2025. From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10516–10543, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding (Zhu et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.517.pdf