Yang Gao
Other people with similar names: Yang Gao, Yang Gao
Unverified author pages with similar names: Yang Gao
2026
MDTeamGPT: Mitigating Context Collapse and Enabling Self-Evolution in Medical Multi-Agent Reasoning
Kai Chen | Xinfeng Li | Tianpei Yang | Hewei Wang | Guang Yang | Jing Huo | Yang Gao
Findings of the Association for Computational Linguistics: ACL 2026
Kai Chen | Xinfeng Li | Tianpei Yang | Hewei Wang | Guang Yang | Jing Huo | Yang Gao
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have shown great potential in multi-disciplinary team (MDT) medical consultations. However, long, multi-round, multi-role interaction trajectories inevitably lead to severe information dilution and context window overload, triggering context collapse which destabilizes reasoning. Furthermore, prior systems typically rely on unstructured trajectory history storage without structurally distilling key information or reflecting on errors, severely limiting continuous learning capabilities. We propose MDTeamGPT, a context-resilient and self-evolving multi-agent framework. Mechanistically, we introduce a specialized Lead Physician mechanism combined with a Residual Context architecture to compress and reorganize multi-round consensus, effectively mitigating context overload and reducing computational costs. For memory, we design a Dual Knowledge Base system comprising a CorrectKB for verified trajectories and a ChainKB for reflective error analysis, enabling self-evolution via retrieval from both successes and failures. We evaluated our framework on standard text datasets (MedQA, PubMedQA), multimodal benchmarks (VQA-RAD, SLAKE), and collected more complex clinical problems. Experimental results show that MDTeamGPT substantially outperforms existing baselines across both text-based and multimodal tasks, while also demonstrating superior diagnostic performance and stability in complex clinical scenarios.
Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
Siyuan Gan | Jiaheng Liu | Boyan Wang | Tianpei Yang | Runqing Miao | Yuyao Zhang | Fanyu Meng | Junlan Feng | Linjian Meng | Jing Huo | Yang Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Siyuan Gan | Jiaheng Liu | Boyan Wang | Tianpei Yang | Runqing Miao | Yuyao Zhang | Fanyu Meng | Junlan Feng | Linjian Meng | Jing Huo | Yang Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards.To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem.In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50\\%$ compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT’s responses, which are classified as not using thinking, remains below $10\\%$ across all tested datasets.