Guorui Zhou
2026
Why Can Distillation Work with Limited Resources? A Systematic Study
Xiao Hu | Xingyu Lu | Liyuan Mao | YiFan Zhang | Tianke Zhang | Bin Wen | Fan Yang | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Xiao Hu | Xingyu Lu | Liyuan Mao | YiFan Zhang | Tianke Zhang | Bin Wen | Fan Yang | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Recently, large language models have made remarkable progress in reasoning, largely driven by scaling data and model size. In parallel, several studies argue that for smaller models, high-quality distillation can yield strong reasoning performance with minimal resources. However, a framework for understanding machine reasoning that explains why low-resource distillation can boost model performance is still missing. In this paper, we conduct a controlled case study: using less than 920 examples, a simple distillation based on the base model can actually achieve notable reasoning performance improvement, compared with the base model and even the zero-RL models. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the base and zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors give rise to flexible reasoning, which is essential for solving reasoning problems.
Unleashing the Native Recommendation Potential: LLM-Based Generative Recommendation via Structured Term Identifiers
Zhiyang Zhang | Junda She | Kuo Cai | Bo Chen | Shiyao Wang | Xinchen Luo | Qiang Luo | Ruiming Tang | Han Li | Kun Gai | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiyang Zhang | Junda She | Kuo Cai | Bo Chen | Shiyao Wang | Xinchen Luo | Qiang Luo | Ruiming Tang | Han Li | Kun Gai | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Leveraging the vast open-world knowledge and understanding capabilities of Large Language Models (LLMs) to develop general-purpose, semantically-aware recommender systems has emerged as a pivotal research direction in generative recommendation. However, existing methods face bottlenecks in constructing item identifiers. Text-based methods introduce LLMs’ vast output space, leading to hallucination, while methods based on Semantic IDs (SIDs) encounter a semantic gap between SIDs and LLMs’ native vocabulary, requiring costly vocabulary expansion and alignment training. To address this, this paper introduces Term IDs (TIDs), defined as a set of semantically rich and standardized textual keywords, to serve as robust item identifiers. We propose GRAM, a novel framework centered on TIDs, employs Context-aware Term Generation to convert item’s metadata into standardized TIDs and utilizes Integrative Instruction Fine-tuning to collaboratively optimize term internalization and sequential recommendation. Additionally, Elastic Identifier Grounding is designed for robust item mapping. Extensive experiments on real-world datasets demonstrate that GRAM significantly outperforms baselines across multiple scenarios, pointing a promising direction for generalizable and high-performance generative recommendation systems.
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Yuntao Li | Wenping Hu | Fuzheng Zhang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Yuntao Li | Wenping Hu | Fuzheng Zhang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose Coordinating Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
Qian Cao | Yahui Liu | Wei Bi | Yi Zhao | Ruihua Song | Xiting Wang | Ruiming Tang | Guorui Zhou | Han Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qian Cao | Yahui Liu | Wei Bi | Yi Zhao | Ruihua Song | Xiting Wang | Ruiming Tang | Guorui Zhou | Han Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
Unlocking Exploration in RLVR: Uncertainty-aware Advantage Shaping for Deeper Reasoning
Can Xie | Ruotong Pan | Xiangyu Wu | Zhang Yunfei | Jiayi Fu | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Can Xie | Ruotong Pan | Xiangyu Wu | Zhang Yunfei | Jiayi Fu | Tingting Gao | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement Learning with Verifiable Rewards (RLVR) has shown significant promise for enhancing the reasoning capabilities of large language models (LLMs). However, prevailing algorithms like GRPO broadcast a uniform advantage signal across all tokens in a sequence. This coarse-grained approach overlooks the pivotal role of uncertain, high-stakes decisions during reasoning, leading to inefficient exploration and the well-documented problem of entropy collapse. To address this, we introduce UnCertainty-aware Advantage Shaping (UCAS), a model-free method that refines credit assignment by leveraging the model’s internal uncertainty signals. UCAS operates in two stages: it first modulates the response-level advantage using the model’s overall self-confidence, and then applies a token-level penalty based on raw logit certainty. This dual mechanism encourages exploration of high-uncertainty paths that yield correct answers while penalizing overconfident yet erroneous reasoning, effectively balancing the exploration-exploitation trade-off. Extensive experiments on five mathematical reasoning benchmarks show that UCAS significantly outperforms strong RLVR baselines across multiple model scales, including 1.5B and 7B. Our analysis confirms that UCAS not only achieves higher rewards but also promotes greater reasoning diversity and successfully mitigates entropy collapse.
DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing
Hongzhi Zhang | Yuanze Hu | Tinghai Zhang | Jia Fu | Tao Wang | Junwei Jing | Zhaoxin Fan | Wei Bi | Ruiming Tang | Han Li | Guorui Zhou | Kun Gai
Findings of the Association for Computational Linguistics: ACL 2026
Hongzhi Zhang | Yuanze Hu | Tinghai Zhang | Jia Fu | Tao Wang | Junwei Jing | Zhaoxin Fan | Wei Bi | Ruiming Tang | Han Li | Guorui Zhou | Kun Gai
Findings of the Association for Computational Linguistics: ACL 2026
The evolution of Large Language Models (LLMs) towards autonomous agents has catalyzed progress in Deep Research. While retrieval capabilities are well-benchmarked, the post-retrieval synthesis stage—where agents must digest massive amounts of context and consolidate fragmented evidence into coherent, long-form reports—remains under-evaluated due to the subjectivity of open-ended writing.To bridge this gap, we introduce DeepSynth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities. We leverage high-quality survey papers as gold standards, reverse-engineer research requests, and construct Oracle Contexts from their bibliographies to isolate synthesis from retrieval noise. We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization), transforming subjective judgment into verifiable metrics. Experiments across 96 tasks reveal that synthesizing information from hundreds of references remains a significant challenge. Our results demonstrate that agentic "plan-then-write" workflows significantly outperform single-turn generation, effectively reducing hallucinations and improving adherence to complex structural constraints.
OneRec-Think: In-Text Reasoning for Generative Recommendation
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The powerful generative capacity of Large Language Models (LLMs) has instigated a paradigm shift in recommendation. However, existing generative models (e.g., OneRec) operate as implicit predictors, critically lacking the capacity for explicit and controllable reasoning—a key advantage of LLMs. To bridge this gap, we propose OneRec-Think, a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation. OneRec-Think incorporates: (1) Itemic Alignment: cross-modal Item-Textual Alignment for semantic grounding; (2) Reasoning Activation: Reasoning Scaffolding to activate LLM reasoning within the recommendation context; and (3) Reasoning Enhancement, where we design a recommendation-specific reward function that accounts for the multi-validity nature of user preferences. Experiments across public benchmarks show state-of-the-art performance. Moreover, our proposed "Think-Ahead" architecture enables effective industrial deployment, achieving a 0.159% gain in APP Stay Time and validating the practical efficacy of the model’s explicit reasoning capability.
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Tiehua Mei | Zijia Lin | Yuntao Li | Wenping Hu | Ruiming Tang | Kun Gai | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Tiehua Mei | Zijia Lin | Yuntao Li | Wenping Hu | Ruiming Tang | Kun Gai | Guorui Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.
Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding
Da Li | Yuxiao Luo | Keping Bi | Jiafeng Guo | Wei Yuan | Biao Yang | Yan Wang | Fan Yang | Tingting Gao | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Da Li | Yuxiao Luo | Keping Bi | Jiafeng Guo | Wei Yuan | Biao Yang | Yan Wang | Fan Yang | Tingting Gao | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness. Our project is available at https://github.com/Trustworthy-Information-Access/CoMa.
Search
Fix author
Co-authors
- Kun Gai 5
- Ruiming Tang 5
- Tingting Gao 3
- Han Li 3
- Kuo Cai 2
- Wenping Hu 2
- Yuntao Li 2
- Xinchen Luo 2
- Minxuan Lv 2
- Leiyu Pan 2
- Zhenpeng Su 2
- Victoria W. 2
- Shiyao Wang 2
- Xiangyu Wu 2
- Fan Yang 2
- Honghui Bao 1
- Keping Bi 1
- Qian Cao 1
- Bo Chen 1
- Hongtao Cheng 1
- Zexuan Cheng 1
- Jiaxin Deng 1
- Zhaoxin Fan 1
- Jiayi Fu 1
- Jia Fu 1
- Jiafeng Guo (嘉丰 郭) 1
- Xiao Hu 1
- Yuanze Hu 1
- Yifei Hu 1
- Qigen Hu 1
- Junwei Jing 1
- Wuchao Li 1
- Da Li 1
- Zijia Lin 1
- Yahui Liu (刘亚慧) 1
- Zhanyu Liu 1
- Xingyu Lu 1
- Qiang Luo 1
- Yuxiao Luo 1
- Liyuan Mao 1
- Tiehua Mei 1
- Ruotong Pan 1
- Lejian Ren 1
- Lu Ren 1
- Junda She 1
- Ruihua Song 1
- Yi Su 1
- Xiting Wang 1
- Tao Wang 1
- Xingmei Wang 1
- Qianqian Wang 1
- Huanjie Wang 1
- Yan Wang 1
- Bin Wen 1
- Yunfan Wu 1
- Can Xie 1
- Biao Yang 1
- Wei Yuan 1
- Zhang Yunfei 1
- Yifan Zhang 1
- Tianke Zhang 1
- Zhiyang Zhang 1
- Fuzheng Zhang 1
- Hongzhi Zhang 1
- Tinghai Zhang 1
- Rongzhou Zhang 1
- Jinghao Zhang 1
- Yi Zhao 1
- PengFei Zheng 1
- Zhang Zixing 1