Guorui Zhou
2026
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
Qian Cao | Yahui Liu | Wei Bi | Yi Zhao | Ruihua Song | Xiting Wang | Ruiming Tang | Guorui Zhou | Han Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qian Cao | Yahui Liu | Wei Bi | Yi Zhao | Ruihua Song | Xiting Wang | Ruiming Tang | Guorui Zhou | Han Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Yuntao Li | Wenping Hu | Fuzheng Zhang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhenpeng Su | Leiyu Pan | Minxuan Lv | Yuntao Li | Wenping Hu | Fuzheng Zhang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose Coordinating Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding
Da Li | Yuxiao Luo | Keping Bi | Jiafeng Guo | Wei Yuan | Biao Yang | Yan Wang | Fan Yang | Tingting Gao | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Da Li | Yuxiao Luo | Keping Bi | Jiafeng Guo | Wei Yuan | Biao Yang | Yan Wang | Fan Yang | Tingting Gao | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that MLLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input enables the embedding model to achieve superior performance on downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform an MLLM into a competitive embedding model. CoMa achieves new state-of-the-art results among MLLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness. Our project is available at https://github.com/Trustworthy-Information-Access/CoMa.
OneRec-Think: In-Text Reasoning for Generative Recommendation
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhanyu Liu | Shiyao Wang | Xingmei Wang | Rongzhou Zhang | Jiaxin Deng | Honghui Bao | Jinghao Zhang | Wuchao Li | PengFei Zheng | Xiangyu Wu | Yifei Hu | Qigen Hu | Xinchen Luo | Lejian Ren | Zhang Zixing | Qianqian Wang | Kuo Cai | Yunfan Wu | Hongtao Cheng | Zexuan Cheng | Lu Ren | Huanjie Wang | Yi Su | Ruiming Tang | Kun Gai | Guorui Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The powerful generative capacity of Large Language Models (LLMs) has instigated a paradigm shift in recommendation. However, existing generative models (e.g., OneRec) operate as implicit predictors, critically lacking the capacity for explicit and controllable reasoning—a key advantage of LLMs. To bridge this gap, we propose OneRec-Think, a unified framework that seamlessly integrates dialogue, reasoning, and personalized recommendation. OneRec-Think incorporates: (1) Itemic Alignment: cross-modal Item-Textual Alignment for semantic grounding; (2) Reasoning Activation: Reasoning Scaffolding to activate LLM reasoning within the recommendation context; and (3) Reasoning Enhancement, where we design a recommendation-specific reward function that accounts for the multi-validity nature of user preferences. Experiments across public benchmarks show state-of-the-art performance. Moreover, our proposed "Think-Ahead" architecture enables effective industrial deployment, achieving a 0.159% gain in APP Stay Time and validating the practical efficacy of the model’s explicit reasoning capability.
Search
Fix author
Co-authors
- Kun Gai 2
- Ruiming Tang 2
- Honghui Bao 1
- Keping Bi 1
- Kuo Cai 1
- Qian Cao 1
- Hongtao Cheng 1
- Zexuan Cheng 1
- Jiaxin Deng 1
- Tingting Gao 1
- Jiafeng Guo (嘉丰 郭) 1
- Qigen Hu 1
- Wenping Hu 1
- Yifei Hu 1
- Da Li 1
- Han Li 1
- Wuchao Li 1
- Yuntao Li 1
- Yahui Liu (刘亚慧) 1
- Zhanyu Liu 1
- Xinchen Luo 1
- Yuxiao Luo 1
- Minxuan Lv 1
- Leiyu Pan 1
- Lejian Ren 1
- Lu Ren 1
- Ruihua Song 1
- Yi Su 1
- Zhenpeng Su 1
- Victoria W. 1
- Huanjie Wang 1
- Qianqian Wang 1
- Shiyao Wang 1
- Xingmei Wang 1
- Xiting Wang 1
- Yan Wang 1
- Xiangyu Wu 1
- Yunfan Wu 1
- Biao Yang 1
- Fan Yang 1
- Wei Yuan 1
- Fuzheng Zhang 1
- Jinghao Zhang 1
- Rongzhou Zhang 1
- Yi Zhao 1
- PengFei Zheng 1
- Zhang Zixing 1
Venues
- ACL4