Fei Yu
2026
Don’t Tell the Answer, Truly Guide the Reasoning During RL Rollouts
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Xinyi Wang | Jinyi Han | Zishang Jiang | Tingyun li | Jiaqing Liang | Sihang Jiang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has emerged as a key approach for improving long chain-of-thought (CoT) reasoning in large language models (LLMs). However, existing methods such as GRPO often break down when task difficulty exceeds the model’s capacity, resulting in sparse rewards and inefficient training. While prior work attempts to address this issue using off-policy data, it frequently introduces distributional mismatch, leading to unstable policy updates.In this work, we identify a fundamental issue underlying these limitations, which we term *low training affinity*, and propose **Affinity**, the first quantitative metric for measuring the compatibility between external guidance and a model’s intrinsic policy. Based on this insight, we introduce **HINT**, an adaptive framework designed to enhance reasoning performance while explicitly preserving high Affinity.HINT consists of two key components. First, instead of providing partial answers, it introduces **Meta-Hints**, which serve as abstract cognitive scaffolding that guides the model to independently construct solutions. Second, we propose **Affinity-Aware Policy Optimization (AAPO)**, which dynamically adjusts the learning objective based on the Affinity signal to ensure stable training.Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while achieving improved training stability and robust generalization to out-of-distribution tasks. Code is available at: https://github.com/ViviqwerAsd/HINT
ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Tingyun li | Zishang Jiang | Jinyi Han | Xinyi Wang | Sihang Jiang | Han Xia | Zhaoqian Dai | Ma Shuguang | Fei Yu | Jiaqing Liang | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency–performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency–performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
Qingyu Ren | Qianyu He | Powei Chang | Jie Zeng | Zeye Sun | Fei Yu | Jiaqing Liang | Yanghua Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qingyu Ren | Qianyu He | Powei Chang | Jie Zeng | Zeye Sun | Fei Yu | Jiaqing Liang | Yanghua Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. We will open-source our code and data to facilitate future research.
Are Large Language Models Reliable Reviewers? A Benchmark for Error Detection in Financial Documents
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ying He | Zhouhong Gu | Zhecheng Hu | Yubo Zhou | Hao Shen | Jiaqing Liang | Zhaoqian Dai | Ma Shuguang | Fei Yu | Yanghua Xiao | Zhixu Li
Findings of the Association for Computational Linguistics: ACL 2026
Ensuring the accuracy of financial documents is critical for economic analysis, regulatory compliance, and corporate decision-making. Several studies have shown that Large Language Models (LLMs) perform well in many financial tasks, such as stock price movements and financial analytics. However, a critical task remains unexplored: the ability of LLMs to identify errors in financial documents. In this paper, we introduce **FinED-Bench**, the first publicly Benchmark for Financial Error Detection across three levels of cognitive complexity. FinED-Bench covers nine real-world financial scenarios, and includes over 900 documents reported in 2025 that are unseen by existing language models. We detail the benchmark construction process and evaluate several advanced LLMs (e.g., GPT-4o, Qwen3-14B) on this tasks, which requires both financial domain knowledge and reasoning capabilities. Experimental results show that current LLMs still struggle with this task, especially in high-complexity cases. Besides, supervised fine-tuning can significantly improve the performance of weaker LLMs on this task. Our data and code are available at https://anonymous.4open.science/r/FinED-Bench-406F.
Toward Automated Robustness Evaluation of Mathematical Reasoning
Yutao Hou | Zeguan Xiao | Fei Yu | Yihan Jiang | Ma Shuguang | Zhaoqian Dai | Hailiang Huang | Yun Chen | Guanhua Chen
Findings of the Association for Computational Linguistics: ACL 2026
Yutao Hou | Zeguan Xiao | Fei Yu | Yihan Jiang | Ma Shuguang | Zhaoqian Dai | Hailiang Huang | Yun Chen | Guanhua Chen
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework’s extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model’s robustness.
2025
PAFT: Prompt-Agnostic Fine-Tuning
Chenxing Wei | Mingwen Ou | Ying He | Yao Shu | Fei Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Chenxing Wei | Mingwen Ou | Ying He | Yao Shu | Fei Yu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fine-tuning large language models (LLMs) often causes overfitting to specific prompt wording, where minor phrasing variations drastically reduce performance. To address this, we propose Prompt-Agnostic Fine-Tuning (PAFT), a method that enhances robustness through dynamic prompt variation during training. PAFT first generates diverse synthetic prompts, then continuously samples from this set to construct training instances, forcing models to learn fundamental task principles rather than surface-level patterns. Across systematic evaluations using both supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT), PAFT consistently demonstrates improved performance on benchmarks for question answering, mathematical reasoning, and tool use. It achieves 7% higher generalization accuracy on unseen prompts than standard methods with similar training efficiency. Notably, models trained with PAFT attain 3.2× faster inference speeds due to reduced prompt sensitivity. Ablation studies further validate effectiveness of PAFT, while theoretical analysis reveals that PAFT can effectively enhance the cross-domain generalization ability of LLM.
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion
Jianqing Zhu | Huang Huang | Zhihang Lin | Juhao Liang | Zhengyang Tang | Khalid Almubarak | Mosen Alharthi | Bang An | Juncai He | Xiangbo Wu | Fei Yu | Junying Chen | Ma Zhuoheng | Yuhao Du | He Zhang | Saied Alshahrani | Emad A. Alghamdi | Lian Zhang | Ruoyu Sun | Haizhou Li | Benyou Wang | Jinchao Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianqing Zhu | Huang Huang | Zhihang Lin | Juhao Liang | Zhengyang Tang | Khalid Almubarak | Mosen Alharthi | Bang An | Juncai He | Xiangbo Wu | Fei Yu | Junying Chen | Ma Zhuoheng | Yuhao Du | He Zhang | Saied Alshahrani | Emad A. Alghamdi | Lian Zhang | Ruoyu Sun | Haizhou Li | Benyou Wang | Jinchao Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or GPT-3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for Arabic LLMs is to utilize Arabic-specific vocabulary in the tokenizer to accelerate decoding. However, using a different vocabulary often leads to degradation of the model’s learned knowledge, since many words become out-of-vocabulary (OOV) at the beginning of training. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion.Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Our model weights are available at: https://github.com/FreedomIntelligence/AraLLaMa.
Flexora: Flexible Low-Rank Adaptation for Large Language Models
Chenxing Wei | Yao Shu | Ying Tiffany He | Fei Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenxing Wei | Yao Shu | Ying Tiffany He | Fei Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have revolutionized artificial intelligence, but their performance on specific tasks is often limited by knowledge boundaries. While fine-tuning techniques like low-rank adaptation (LoRA) aim to address this, they can suffer from overfitting. We propose flexible low-rank adaptation (Flexora), a novel method that automatically selects the most critical layers for fine-tuning to optimize performance across diverse downstream tasks. Flexora formulates layer selection as a hyperparameter optimization problem, employs unrolled differentiation for efficient solving, and identifies the most impactful layers based on optimized hyperparameters. Extensive experiments across various pre-trained models and natural language tasks demonstrate that Flexora consistently outperforms existing baselines. We provide theoretical insights and comprehensive ablation studies to elucidate the effectiveness of Flexora. Therefore, Flexora offers a robust solution to enhance LoRA fine-tuning for LLMs, potentially advancing the field of adaptive language model optimization.
Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following
Jie Zeng | Qianyu He | Qingyu Ren | Jiaqing Liang | Weikang Zhou | Zeye Sun | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2025
Jie Zeng | Qianyu He | Qingyu Ren | Jiaqing Liang | Weikang Zhou | Zeye Sun | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2025
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a “hard-to-easy” order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM’s attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
Sirui Xia | Xintao Wang | Jiaqing Liang | Yifei Zhang | Weikang Zhou | Jiaji Deng | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: NAACL 2025
Sirui Xia | Xintao Wang | Jiaqing Liang | Yifei Zhang | Weikang Zhou | Jiaji Deng | Fei Yu | Yanghua Xiao
Findings of the Association for Computational Linguistics: NAACL 2025
Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. To enhance credibility and verifiability in RAG systems, Attributed Text Generation (ATG) is proposed, which provides citations to retrieval knowledge in LLM-generated responses. Prior methods mainly adopt coarse-grained attributions, with passage-level or paragraph-level references or citations, which fall short in verifiability. This paper proposes ReClaim(Refer & Claim), a fine-grained ATG method that alternates the generation of references and answers step by step. Different from previous coarse-grained attribution, ReClaim provides sentence-level citations in long-form question-answering tasks. With extensive experiments, we verify the effectiveness of ReClaim in extensive settings, achieving a citation accuracy rate of 90%.
Order Doesn’t Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation
Qianxi He | Qianyu He | Jiaqing Liang | Weikang Zhou | Zeye Sun | Fei Yu | Yanghua Xiao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Qianxi He | Qianyu He | Jiaqing Liang | Weikang Zhou | Zeye Sun | Fei Yu | Yanghua Xiao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs’ reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://anonymous.4open.science/r/Order-Centric-Data-Augmentation-822C.
Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
Qingyu Ren | Jie Zeng | Qianyu He | Jiaqing Liang | Yanghua Xiao | Weikang Zhou | Zeye Sun | Fei Yu
Findings of the Association for Computational Linguistics: ACL 2025
Qingyu Ren | Jie Zeng | Qianyu He | Jiaqing Liang | Yanghua Xiao | Weikang Zhou | Zeye Sun | Fei Yu
Findings of the Association for Computational Linguistics: ACL 2025
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. In real-world scenarios, user instructions often contain soft constraints, which are semantically related and cannot be rule-based verified, posing challenges for LLMs. To enhance the soft constraint following ability of LLMs, we initially design a pipeline to construct datasets with high-quality outputs for instructions containing soft constraints automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs’ soft constraint following ability and analyze the factors driving the improvements.
2024
Teaching Small Language Models Reasoning through Counterfactual Distillation
Tao Feng | Yicheng Li | Chenglin Li | Hao Chen | Fei Yu | Yin Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Tao Feng | Yicheng Li | Chenglin Li | Hao Chen | Fei Yu | Yin Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With the rise of large language models (LLMs), many studies are interested in transferring the reasoning capabilities of LLMs to small language models (SLMs). Previous distillation methods usually utilize the capabilities of LLMs to generate chain-of-thought (CoT) samples and teach SLMs via fine-tuning. However, such a standard distillation approach performs poorly when applied to out-of-distribution (OOD) examples, and the diversity of the generated CoT samples is insufficient. In this work, we propose a novel counterfactual distillation framework. Firstly, we leverage LLMs to automatically generate high-quality counterfactual data. Given an input text example, our method generates a counterfactual example that is very similar to the original input, but its task label has been changed to the desired one. Then, we utilize multi-view CoT to enhance the diversity of reasoning samples. Experiments on four NLP benchmarks show that our approach enhances the reasoning capabilities of SLMs and is more robust to OOD data. We also conduct extensive ablations and sample studies to understand the reasoning capabilities of SLMs.
AceGPT, Localizing Large Language Models in Arabic
Huang Huang | Fei Yu | Jianqing Zhu | Xuening Sun | Hao Cheng | Song Dingjie | Zhihong Chen | Mosen Alharthi | Bang An | Juncai He | Ziche Liu | Junying Chen | Jianquan Li | Benyou Wang | Lian Zhang | Ruoyu Sun | Xiang Wan | Haizhou Li | Jinchao Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Huang Huang | Fei Yu | Jianqing Zhu | Xuening Sun | Hao Cheng | Song Dingjie | Zhihong Chen | Mosen Alharthi | Bang An | Juncai He | Ziche Liu | Junying Chen | Jianquan Li | Benyou Wang | Lian Zhang | Ruoyu Sun | Xiang Wan | Haizhou Li | Jinchao Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed ‘AceGPT’, sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
Fei Yu | Anningzhe Gao | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2024
Fei Yu | Anningzhe Gao | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2024
Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search
Chenglin Li | Qianglong Chen | Zhi Li | Feng Tao | Yicheng Li | Hao Chen | Fei Yu | Yin Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Chenglin Li | Qianglong Chen | Zhi Li | Feng Tao | Yicheng Li | Hao Chen | Fei Yu | Yin Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024
Instruction tuning is a crucial technique for aligning language models with humans’ actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However, creating high-quality data manually is labor-intensive and time-consuming, which leads researchers to explore using LLMs to synthesize data. Recent studies have focused on using a stronger LLM to iteratively enhance existing instruction data, showing promising results. Nevertheless, previous work often lacks control over the evolution direction, resulting in high uncertainty in the data synthesis process and low-quality instructions. In this paper, we introduce a general and scalable framework, IDEA-MCTS (Instruction Data Enhancement using Monte Carlo Tree Search), a scalable framework for efficiently synthesizing instructions. With tree search and evaluation models, it can efficiently guide each instruction to evolve into a high-quality form, aiding in instruction fine-tuning. Experimental results show that IDEA-MCTS significantly enhances the seed instruction data, raising the average evaluation scores of quality, diversity, and complexity from 2.19 to 3.81. Furthermore, in open-domain benchmarks, experimental results show that IDEA-MCTS improves the accuracy of real-world instruction-following skills in LLMs by an average of 5% in low-resource settings.
2023
HuatuoGPT, Towards Taming Language Model to Be a Doctor
Hongbo Zhang | Junying Chen | Feng Jiang | Fei Yu | Zhihong Chen | Guiming Chen | Jianquan Li | Xiangbo Wu | Zhang Zhiyi | Qingying Xiao | Xiang Wan | Benyou Wang | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2023
Hongbo Zhang | Junying Chen | Feng Jiang | Fei Yu | Zhihong Chen | Guiming Chen | Jianquan Li | Xiangbo Wu | Zhang Zhiyi | Qingying Xiao | Xiang Wan | Benyou Wang | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2023
In this paper, we present HuatuoGPT, a Large Language Model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both distilled data from **ChatGPT** and real-world data from **doctors** in the supervised fine-tuning stage. This is not only because purely using **ChatGPT**-distilled data might cause ‘model collapse’, but also because real-world data from **doctors** would be complementary to **ChatGPT**-distilled data. The responses from ChatGPT are usually detailed, well-presented, fluent, and instruction-followed, but it cannot perform like a doctor in many aspects, e.g. for interactive diagnosis. Therefore, the extra doctors’ data could tame a distilled language model to perform like doctors. To synergize the strengths of both data sources, we introduce RLMF (Reinforcement Learning from Mixed Feedback) where a reward model is trained to align the language model with the merits that both sources (ChatGPT and doctors) bring. Experimental results (in GPT-4 evaluation, human evaluation, and medical benchmark datasets) demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs. It is worth noting that by using additional real-world data and RLMF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model (i.e., ChatGPT) in most cases.
Search
Fix author
Co-authors
- Jiaqing Liang 8
- Yanghua Xiao 8
- Zhaoqian Dai 4
- Qianyu He 4
- Ma Shuguang 4
- Zeye Sun 4
- Benyou Wang 4
- Weikang Zhou 4
- Junying Chen 3
- Haizhou Li 3
- Qingyu Ren 3
- Jie Zeng 3
- Mosen Alharthi 2
- Bang An 2
- Hao Chen 2
- Zhihong Chen 2
- Jinyi Han 2
- Ying He 2
- Juncai He 2
- Huang Huang 2
- Zishang Jiang 2
- Sihang Jiang 2
- Yicheng Li 2
- Chenglin Li 2
- Jianquan Li 2
- Yao Shu 2
- Ruoyu Sun 2
- Xiang Wan 2
- Xinyi Wang 2
- Chenxing Wei 2
- Xiangbo Wu 2
- Jinchao Xu 2
- Lian Zhang 2
- Yin Zhang 2
- Jianqing Zhu 2
- Tingyun li 2
- Emad A. Alghamdi 1
- Khalid Almubarak 1
- Saied Alshahrani 1
- Powei Chang 1
- Yun Chen 1
- Guanhua Chen 1
- Guiming Chen 1
- Qianglong Chen 1
- Hao Cheng 1
- Jiaji Deng 1
- Song Dingjie 1
- Yuhao Du 1
- Tao Feng 1
- Anningzhe Gao 1
- Zhouhong Gu 1
- Ying Tiffany He 1
- Qianxi He 1
- Yutao Hou 1
- Zhecheng Hu 1
- Hailiang Huang 1
- Yihan Jiang 1
- Feng Jiang (蒋峰) 1
- Zhixu Li 1
- Zhi Li 1
- Juhao Liang 1
- Zhihang Lin 1
- Ziche Liu 1
- Mingwen Ou 1
- Hao Shen 1
- Xuening Sun 1
- Zhengyang Tang 1
- Feng Tao 1
- Xintao Wang 1
- Sirui Xia 1
- Han Xia 1
- Zeguan Xiao 1
- Qingying Xiao 1
- He Zhang 1
- Yifei Zhang 1
- Hongbo Zhang 1
- Zhang Zhiyi 1
- Yubo Zhou 1
- Ma Zhuoheng 1