Jinyang Zhang

2026

High-quality, diverse data are vital for large language models (LLMs) but remain scarce and costly. Data synthesis is a viable alternative and succeeds on closed tasks, yet the humanities and social sciences (HSS) are overlooked, and their open-ended nature makes synthesis challenging.Moving beyond prior capability-centric, fragmented attempts, we adopt a subject-centric paradigm, define the first HSS domain system covering 14 mainstream fields, and introduce HSS-Synth—the first data synthesis pipeline for HSS.HSS-Synth comprises: (1) constructing seed document from web corpora via multi-step filtering and text refinement evaluated by a judge; (2) specifying “requirements + persona” to backtranslate seed document into diverse yet faithful instructions with strict Q&A alignment check; and (3) breaking LLM response limits via teacher-forced Answering that fed seed documents during response to anchor semantics, reduce hallucinations, and preserve tone and integrity.HSS-Synth yields 237k high-quality, diverse instruction-tuning samples that outperform 14 leading baselines on 16 benchmarks. The fine-tuned Qwen3-8B-Base set new SOTA and approached official Qwen3-8B, improving both human preference and knowledge capability without performance seesaws. Extensive experiments demonstrate the HSS-Synth’s robustness and transferability.Our code is publicly available at https://github.com/pengr/HSS-Synth.

pdf bib abs

Interactive medical questioning is essential in clinical consultations, where physicians must actively gather necessary patient information. Yet existing medical Large Language Models (LLMs) predominantly follow a reactive paradigm, risking diagnostic errors by answering before seeking sufficient details. To bridge this gap, we propose ProMed, a reinforcement learning framework that transitions LLMs toward a proactive paradigm, enabling them to ask clinically valuable questions before decision-making. Central to ProMed is the Shapley Information Gain (SIG) reward, which quantifies a question’s clinical utility as the amount of newly acquired information, while considering its contextual importance via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search to construct high-reward interaction trajectories for supervision, and (2) SIG-Augmented Policy Optimization, with a novel SIG-guided Reward Distribution Mechanism that prioritizes informative questions for fine-grained optimization. Experiments on partial-information medical benchmarks show that ProMed significantly outperforms state-of-the-art methods by 6.29% on average and delivers a 54.45% gain over the reactive paradigm, and generalizes robustly to out-of-domain cases. Our codes are available at https://github.com/hxxding/ProMed.

2025

pdf bib abs

The multilingual capabilities of large language models (LLMs) have attracted considerable attention over the past decade. Assessing the accuracy with which LLMs provide answers in multilingual contexts is essential for determining their level of multilingual proficiency. Nevertheless, existing multilingual benchmarks generally reveal severe drawbacks, such as overly translated content (translationese), the absence of difficulty control, constrained diversity, and disciplinary imbalance, making the benchmarking process unreliable and showing low convincingness. To alleviate those shortcomings, we introduce NOVA-63 (Native Omni-lingual Versatile Assessments of 63 Disciplines), a comprehensive, difficult multilingual benchmark featuring 93,536 questions sourced from native speakers across 14 languages and 63 academic disciplines. Leveraging a robust pipeline that integrates LLM-assisted formatting, expert quality verification, and multi-level difficulty screening, NOVA-63 is balanced on disciplines with consistent difficulty standards while maintaining authentic linguistic elements. Extensive experimentation with current LLMs has shown significant insights into cross-lingual consistency among language families, and exposed notable disparities in models’ capabilities across various disciplines. This work provides valuable benchmarking data for the future development of multilingual models. Furthermore, our findings underscore the importance of moving beyond overall scores and instead conducting fine-grained analyses of model performance.

pdf bib abs

Large Language Models (LLMs) excel in general language tasks, motivating their adaptation to specialized domains such as healthcare. Effective domain adaptation typically involves supervised fine-tuning (SFT) on carefully selected instruction-tuning data. Current data selection methods adopt a data-centric approach, relying on external annotations and heuristics to identify externally defined high-quality or challenging data. Our exploratory experiments highlight this approach fails to improve the model’s domain performance, due to misalignment between selected data and the model’s knowledge distribution. To tackle this, we propose Decomposed Difficulty-based Data Selection (3DS), a two-stage model-centric data selection framework that aligns data selection with the model’s distribution. 3DS employs Prompt-Driven Data Selection to filter out noise based on the model’s knowledge via explicit alignment in Stage#1, then adopts Decomposed Difficulty-based Data Selection to guide selection via three novel data difficulty metrics, including Instruction Understanding, Response Confidence, and Response Correctness in Stage#2, enhanced by an attention-based importance weighting mechanism for accurate calibration.Extensive experiments in the healthcare domain show 3DS outperforms existing methods by up to 2.97% accuracy, with additional validation in law and general domains, confirming its generalization ability. Our dataset and code are open-sourced at https://github.com/PuppyKnightUniversity/3DS.