Haibo Shi
2026
Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Supervised Fine-Tuning (SFT) is the standard approach for adapting large language models (LLMs) to downstream tasks. However, we observe a persistent failure mode: even after convergence, models often fail to correctly reproduce a subset of their own supervised training data. We refer to this behavior as the Incomplete Learning Phenomenon (ILP). This paper presents the first systematic study of ILP in LLM fine-tuning. We formalize ILP as post-training failure to internalize supervised instances and demonstrate its prevalence across multiple model families, domains, and datasets. Through controlled analyses, we identify five recurrent sources of incomplete learning: (1) missing prerequisite knowledge in the pre-trained model, (2) conflicts between SFT supervision and pre-training knowledge, (3) internal inconsistencies within SFT data, (4) left-side forgetting during sequential fine-tuning, and (5) insufficient optimization for rare or complex patterns. We introduce a diagnostic-first framework that maps unlearned samples to these causes using observable training and inference signals, and study several targeted mitigation strategies as causal interventions. Experiments on Qwen, LLaMA, and OLMo2 show that incomplete learning is widespread and heterogeneous, and that improvements in aggregate metrics can mask persistent unlearned subsets. The findings highlight the need for fine-grained diagnosis of what supervised fine-tuning fails to learn, and why.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Findings of the Association for Computational Linguistics: ACL 2026
Chao Xue | Yao Wang | Mengqiao Liu | Di Liang | Xingsheng Han | Peiyang Liu | Xianjie Wu | Chenyao Lu | Lei Jiang | Yu Lu | Haibo Shi | Shuang Liang | Minlong Peng | Flora D. Salim
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in the Generative Reward Model (GRM) have demonstrated its potential to enhance the reasoning abilities of LLMs through Chain-of-Thought (CoT) prompting. Despite these gains, existing implementations of GRM suffer from two critical limitations. First, CoT prompting is applied indiscriminately to all inputs regardless of their inherent complexity. This introduces unnecessary computational costs for tasks amenable to fast, direct inference. Second, existing approaches primarily rely on voting-based mechanisms to evaluate CoT outputs, which often lack granularity and precision in assessing reasoning quality. In this paper, we propose E-GRM, an efficient generative reward modeling framework grounded in model-internal uncertainty. E-GRM leverages the convergence behavior of parallel model generations to estimate uncertainty and selectively trigger CoT reasoning only when needed, without relying on handcrafted features or task-dependent signals. To improve reward fidelity, we introduce a lightweight discriminative scorer trained with a hybrid regression–ranking objective to provide fine-grained evaluation of reasoning paths. Experiments on multiple reasoning benchmarks show that E-GRM substantially reduces inference cost while consistently improving answer accuracy, demonstrating that model-internal uncertainty is an effective and general signal for efficient reasoning-aware reward modeling.
2025
FinEval-KR: A Financial Domain Evaluation Framework for Large Language Models’ Knowledge and Reasoning
Shaoyu Dou | Yutian Shen | Mofan Chen | Zixuan Wang | Jiajie Xu | Qi Guo | Kailai Shao | Chao Chen | Haixiang Hu | Haibo Shi | Min Min | Liwen Zhang
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing
Shaoyu Dou | Yutian Shen | Mofan Chen | Zixuan Wang | Jiajie Xu | Qi Guo | Kailai Shao | Chao Chen | Haixiang Hu | Haibo Shi | Min Min | Liwen Zhang
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing
2024
GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation
Wenjie Zhou | Zhenxin Ding | Xiaodong Zhang | Haibo Shi | Junfeng Wang | Dawei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Wenjie Zhou | Zhenxin Ding | Xiaodong Zhang | Haibo Shi | Junfeng Wang | Dawei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. However, for practical deployment, it is crucial to perform knowledge distillation to maintain high performance while operating under computational constraints. In this paper, we address a key question: given the importance of unsupervised distillation for student model performance, how can knowledge from multiple teacher models be effectively ensemble during this stage without the guidance of labels? We propose a novel algorithm, GOVERN, to tackle this issue. GOVERN has demonstrated significant improvements in both offline and online experiments, enabling the student model to achieve results comparable to that of teacher ensembles. Our experiments show that GOVERN remarkably requires a mere 1% of the ensemble method’s inference budget to achieve 99.5% of performance. The proposed algorithm has been successfully deployed in a real-world commercial question-answering system, demonstrating its real-world applicability.
Learning to Use Tools via Cooperative and Interactive Agents
Zhengliang Shi | Shen Gao | Xiuyi Chen | Yue Feng | Lingyong Yan | Haibo Shi | Dawei Yin | Pengjie Ren | Suzan Verberne | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2024
Zhengliang Shi | Shen Gao | Xiuyi Chen | Yue Feng | Lingyong Yan | Haibo Shi | Dawei Yin | Pengjie Ren | Suzan Verberne | Zhaochun Ren
Findings of the Association for Computational Linguistics: EMNLP 2024
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. Existing methods employ one single LLM-based agent to iteratively select and execute tools, thereafter incorporating execution results into the next action prediction. Despite their progress, these methods suffer from performance degradation when addressing practical tasks due to: (1) the pre-defined pipeline with restricted flexibility to calibrate incorrect actions, and (2) the struggle to adapt a general LLM-based agent to perform a variety of specialized actions. To mitigate these problems, we propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. ConAgents introduces two communication protocols to enable the flexible cooperation of agents. To effectively generalize the ConAgents into open-source models, we also propose specialized action distillation, enhancing their ability to perform specialized actions in our framework. Our extensive experiments on three datasets show that the LLMs, when equipped with the ConAgents, outperform baselines with substantial improvement (i.e., up to 14% higher success rate).
ATM: Adversarial Tuning Multi-agent System Makes a Robust Retrieval-Augmented Generator
Junda Zhu | Lingyong Yan | Haibo Shi | Dawei Yin | Lei Sha
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Junda Zhu | Lingyong Yan | Haibo Shi | Dawei Yin | Lei Sha
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) are proven to benefit a lot from retrieval-augmented generation (RAG) in alleviating hallucinations confronted with knowledge-intensive questions. RAG adopts information retrieval techniques to inject external knowledge from semantic-relevant documents as input contexts. However, due to today’s Internet being flooded with numerous noisy and fabricating content, it is inevitable that RAG systems are vulnerable to these noises and prone to respond incorrectly. To this end, we propose to optimize the retrieval-augmented Generator with a Adversarial Tuning Multi-agent system **(ATM)**. The ATM steers the Generator to have a robust perspective of useful documents for question answering with the help of an auxiliary Attacker agent. The Generator and the Attacker are tuned adversarially for several iterations. After rounds of multi-agent iterative tuning, the Generator can eventually better discriminate useful documents amongst fabrications. The experimental results verify the effectiveness of ATM and we also observe that the Generator can achieve better performance compared to state-of-the-art baselines.
KnowTuning: Knowledge-aware Fine-tuning for Large Language Models
Yougang Lyu | Lingyong Yan | Shuaiqiang Wang | Haibo Shi | Dawei Yin | Pengjie Ren | Zhumin Chen | Maarten de Rijke | Zhaochun Ren
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yougang Lyu | Lingyong Yan | Shuaiqiang Wang | Haibo Shi | Dawei Yin | Pengjie Ren | Zhumin Chen | Maarten de Rijke | Zhaochun Ren
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Despite their success at many natural language processing (NLP) tasks, large language models still struggle to effectively leverage knowledge for knowledge-intensive tasks, manifesting limitations such as generating incomplete, non-factual, or illogical answers. These limitations stem from inadequate knowledge awareness of LLMs during vanilla fine-tuning. To address these problems, we propose a knowledge-aware fine-tuning (KnowTuning) method to improve fine-grained and coarse-grained knowledge awareness of LLMs. We devise a fine-grained knowledge augmentation stage to train LLMs to identify difficult fine-grained knowledge in answers. We also propose a coarse-grained knowledge comparison stage to train LLMs to distinguish between reliable and unreliable knowledge, in three aspects: completeness, factuality, and logicality. Extensive experiments on both generic and medical question answering (QA) datasets confirm the effectiveness of KnowTuning, through automatic and human evaluations, across various sizes of LLMs. We further verify that KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.
Search
Fix author
Co-authors
- Dawei Yin 4
- Lingyong Yan 3
- Xingsheng Han 2
- Lei Jiang 2
- Di Liang 2
- Shuang Liang 2
- Mengqiao Liu 2
- Peiyang Liu 2
- Chenyao Lu 2
- Yu Lu 2
- Minlong Peng 2
- Pengjie Ren 2
- Zhaochun Ren 2
- Flora D. Salim 2
- Yao Wang 2
- Xianjie Wu 2
- Chao Xue 2
- Mofan Chen 1
- Chao Chen 1
- Xiuyi Chen 1
- Zhumin Chen 1
- Zhenxin Ding 1
- Shaoyu Dou 1
- Yue Feng 1
- Shen Gao 1
- Qi Guo 1
- Haixiang Hu 1
- Yougang Lyu 1
- Min Min 1
- Lei Sha 1
- Kailai Shao 1
- Yutian Shen 1
- Zhengliang Shi 1
- Suzan Verberne 1
- Zixuan Wang 1
- Junfeng Wang 1
- Shuaiqiang Wang 1
- Jiajie Xu 1
- Liwen Zhang 1
- Xiaodong Zhang 1
- Wenjie Zhou 1
- Junda Zhu 1
- Maarten de Rijke 1