Zhaoxiang Liu
2026
Mixture of Heterogeneous Grouped Experts for Language Modeling
Zhicheng Ma | Xiang Liu | Zhaoxiang Liu | Ning Wang | Yi Shen | Kai Wang | Shuming Shi | Shiguo Lian
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Zhicheng Ma | Xiang Liu | Zhaoxiang Liu | Ning Wang | Yi Shen | Kai Wang | Shuming Shi | Shiguo Lian
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes, creating a rigidity that fails to align computational costs with varying token-level complexity. While heterogeneous expert architectures attempt to address this by diversifying expert sizes, they often suffer from significant system-level challenges, specifically unbalanced GPU utilization and inefficient parameter utilization, which hinder practical deployment.To bridge the gap between theoretical heterogeneity and robust industrial application, we propose Mixture of Heterogeneous Grouped Experts (MoHGE) which introduces a two-level routing mechanism to enable flexible, resource-aware expert combinations. To optimize inference efficiency, we propose a Group-Wise Auxiliary Loss, which dynamically steers tokens to the most parameter-efficient expert groups based on task difficulty.To address the critical deployment challenge of GPU load balancing, we introduce an All-size Group-decoupling Allocation strategy coupled with an Intra-Group Experts Auxiliary Loss. These mechanisms collectively ensure uniform computation distribution across GPUs.Extensive evaluations demonstrate that MoHGE matches the performance of MoE architectures while reducing the total parameters by approximately 20% and maintaining balanced GPU utilization. Our work establishes a scalable paradigm for resource-efficient MoE design, offering a practical solution for optimizing inference costs in real-world scenarios.
2025
Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity
Ping Chen | Xiang Liu | Zhaoxiang Liu | Zezhou Chen | Xingpeng Zhang | Huan Hu | Zipeng Wang | Kai Wang | Shuming Shi | Shiguo Lian
Findings of the Association for Computational Linguistics: EMNLP 2025
Ping Chen | Xiang Liu | Zhaoxiang Liu | Zezhou Chen | Xingpeng Zhang | Huan Hu | Zipeng Wang | Kai Wang | Shuming Shi | Shiguo Lian
Findings of the Association for Computational Linguistics: EMNLP 2025
With the rapid advancement of large language models (LLMs), natural language processing (NLP) has achieved remarkable progress. Nonetheless, significant challenges remain in handling texts with ambiguity, polysemy, or uncertainty. We introduce the Fuzzy Reasoning Chain (FRC) framework, which integrates LLM semantic priors with continuous fuzzy membership degrees, creating an explicit interaction between probability-based reasoning and fuzzy membership reasoning. This transition allows ambiguous inputs to be gradually transformed into clear and interpretable decisions while capturing conflicting or uncertain signals that traditional probability-based methods cannot. We validate FRC on sentiment analysis tasks, where both theoretical analysis and empirical results show that it ensures stable reasoning and facilitates knowledge transfer across different model scales. These findings indicate that FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness.
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
Yi Shen | Jian Zhang | Jieyun Huang | Shuming Shi | Wenjing Zhang | Jiangze Yan | Ning Wang | Kai Wang | Zhaoxiang Liu | Shiguo Lian
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Yi Shen | Jian Zhang | Jieyun Huang | Shuming Shi | Wenjing Zhang | Jiangze Yan | Ning Wang | Kai Wang | Zhaoxiang Liu | Shiguo Lian
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent advancements in slow-thinking reasoning models have shown exceptional performance in complex reasoning tasks. However, their tendency for “overthinking” on simple problems leads to excessive computational resource usage and increased inference latency, which hinders their widespread industrial adoption. While current mitigation strategies uniformly reduce reasoning tokens, they risk degrading performance on challenging tasks that require extended reasoning. This paper introduces Difficulty-Adaptive Slow-Thinking (DAST), a novel framework that enables models to autonomously adjust Chain-of-Thought (CoT) length based on problem difficulty. We propose a Token Length Budget (TLB) metric and leverage budget-aware preference optimization to implement DAST, which penalizes inefficiency on simple problems while incentivizing deep reasoning for complex ones. Experiments demonstrate DAST’s significant value for practical application: it effectively mitigates overthinking, substantially lowering costs and latency—while crucially preserving high accuracy on complex problems, paving the way for the efficient application of advanced reasoning models.