Yao Wang
2026
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon (ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Findings of the Association for Computational Linguistics: ACL 2026
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression–ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.
Half-S: Halving the Scale for Near-Lossless 4-Bit LLM Training
Jinyang Du | Ruihao Gong | Linghan Ai | Zining Wang | Yunke Peng | Yao Wang | Lei Yan | Wxuefei | Yaoyuan Wang | Jinyang Guo | Dahua Lin | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Jinyang Du | Ruihao Gong | Linghan Ai | Zining Wang | Yunke Peng | Yao Wang | Lei Yan | Wxuefei | Yaoyuan Wang | Jinyang Guo | Dahua Lin | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Training large language models (LLMs) at 4-bit precision offers substantial efficiency gains but remains challenging due to the limited dynamic range and coarse numerical resolution. Existing 4-bit training pipelines typically rely on max-scaling, which is ill-suited for heavy-tailed LLM tensor distributions and leads to severe under-utilization of the FP4 quantization grid in the low-magnitude region. This effect causes pronounced representation collapse and large rounding errors for the values that dominate LLM computation. In this work, we derive the theoretically optimal scaling for FP4 under heavy-tailed inputs, revealing why max-scaling is intrinsically suboptimal. Guided by this analysis, we propose Half-S, a simple and efficient scaling strategy that uses half-scaling as a hardware-friendly default and falls back to an MSE-based clipping threshold when needed, yielding a close approximation to the theoretical optimum under real LLM statistics. Extensive experiments on large-scale pretraining and downstream fine-tuning show that Half-S consistently narrows the gap to BF16 in both convergence and final model quality, while preserving the efficiency benefits of 4-bit computation. Under native FP4 support, Half-S is estimated to provide up to 1.8× end-to-end training speedup. These results indicate that Half-S provides a simple and effective correction to max-scaling, substantially improving the stability and accuracy of 4-bit LLM training.
2025
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Yao Wang | Di Liang | Minlong Peng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yao Wang | Di Liang | Minlong Peng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel Core Parameter Isolation Fine-Tuning (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.
2020
Find or Classify? Dual Strategy for Slot-Value Predictions on Multi-Domain Dialog State Tracking
Jianguo Zhang | Kazuma Hashimoto | Chien-Sheng Wu | Yao Wang | Philip Yu | Richard Socher | Caiming Xiong
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Jianguo Zhang | Kazuma Hashimoto | Chien-Sheng Wu | Yao Wang | Philip Yu | Richard Socher | Caiming Xiong
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing approaches for DST mainly fall into one of two categories, namely, ontology-based and ontology-free methods. An ontology-based method selects a value from a candidate-value list for each target slot, while an ontology-free method extracts spans from dialog contexts. Recent work introduced a BERT-based model to strike a balance between the two methods by pre-defining categorical and non-categorical slots. However, it is not clear enough which slots are better handled by either of the two slot types, and the way to use the pre-trained model has not been well investigated. In this paper, we propose a simple yet effective dual-strategy model for DST, by adapting a single BERT-style reading comprehension model to jointly handle both the categorical and non-categorical slots. Our experiments on the MultiWOZ datasets show that our method significantly outperforms the BERT-based counterpart, finding that the key is a deep interaction between the domain-slot and context information. When evaluated on noisy (MultiWOZ 2.0) and cleaner (MultiWOZ 2.1) settings, our method performs competitively and robustly across the two different settings. Our method sets the new state of the art in the noisy setting, while performing more robustly than the best model in the cleaner setting. We also conduct a comprehensive error analysis on the dataset, including the effects of the dual strategy for each slot, to facilitate future research.
Search
Fix author
Co-authors
- Di Liang 3
- Minlong Peng 3
- Xingsheng Han 2
- Lei Jiang 2
- Shuang Liang 2
- Mengqiao Liu 2
- Peiyang Liu 2
- Chenyao Lu 2
- Yu Lu 2
- Flora D. Salim 2
- Haibo Shi 2
- Xianjie Wu 2
- Chao Xue 2
- Linghan Ai 1
- Jinyang Du 1
- Ruihao Gong 1
- Jinyang Guo 1
- Kazuma Hashimoto 1
- Dahua Lin 1
- Xianglong Liu 1
- Yunke Peng 1
- Richard Socher 1
- Zining Wang 1
- Yaoyuan Wang 1
- Chien-Sheng Wu 1
- Wxuefei 1
- Caiming Xiong 1
- Lei Yan 1
- Philip S. Yu 1
- Jianguo Zhang 1