Xijun Gu

2026

High-quality, diverse data are vital for large language models (LLMs) but remain scarce and costly. Data synthesis is a viable alternative and succeeds on closed tasks, yet the humanities and social sciences (HSS) are overlooked, and their open-ended nature makes synthesis challenging.Moving beyond prior capability-centric, fragmented attempts, we adopt a subject-centric paradigm, define the first HSS domain system covering 14 mainstream fields, and introduce HSS-Synth—the first data synthesis pipeline for HSS.HSS-Synth comprises: (1) constructing seed document from web corpora via multi-step filtering and text refinement evaluated by a judge; (2) specifying “requirements + persona” to backtranslate seed document into diverse yet faithful instructions with strict Q&A alignment check; and (3) breaking LLM response limits via teacher-forced Answering that fed seed documents during response to anchor semantics, reduce hallucinations, and preserve tone and integrity.HSS-Synth yields 237k high-quality, diverse instruction-tuning samples that outperform 14 leading baselines on 16 benchmarks. The fine-tuned Qwen3-8B-Base set new SOTA and approached official Qwen3-8B, improving both human preference and knowledge capability without performance seesaws. Extensive experiments demonstrate the HSS-Synth’s robustness and transferability.Our code is publicly available at https://github.com/pengr/HSS-Synth.

Co-authors

Kexin Yang 1

Venues

Findings1

Fix author