Enyu Zhou
2026
MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Jiahang Lin | Kai Hu | Binghai Wang | Yuhao Zhou | Zhiheng Xi | Honglin Guo | Shichun Liu | Junzhe Wang | Shihan Dou | Enyu Zhou | Hang Yan | Zhenhua Han | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2026
Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce **MM-Doc-R1**, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose **Similarity-based Policy Optimization (SPO)**, addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that **MM-Doc-R1** outperforms previous baselines by **10.4%**. Furthermore, **SPO** demonstrates superior performance over **GRPO**, boosting results by **5.0%** with Qwen3-8B and **6.1%** with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.
VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dingwei Zhu | Shihan Dou | Zhiheng Xi | Senjie Jin | Guoqiang Zhang | Jiazheng Zhang | Junjie Ye | Mingxu Chai | Enyu Zhou | Ming Zhang | Yuhui Wang | Caishuang Huang | Chenhao Huang | Yunke Zhang | Yuran Wang | Tao Gui | Qi Zhang | Xipeng Qiu | Xuanjing Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.
2024
StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback
Shihan Dou | Yan Liu | Haoxiang Jia | Enyu Zhou | Limao Xiong | Junjie Shan | Caishuang Huang | Xiao Wang | Xiaoran Fan | Zhiheng Xi | Yuhao Zhou | Tao Ji | Rui Zheng | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shihan Dou | Yan Liu | Haoxiang Jia | Enyu Zhou | Limao Xiong | Junjie Shan | Caishuang Huang | Xiao Wang | Xiaoran Fan | Zhiheng Xi | Yuhao Zhou | Tao Ji | Rui Zheng | Qi Zhang | Tao Gui | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The advancement of large language models (LLMs) has significantly propelled the field of code generation. Previous work integrated reinforcement learning (RL) with compiler feedback for exploring the output space of LLMs to enhance code generation quality. However, the lengthy code generated by LLMs in response to complex human requirements makes RL exploration a challenge. Also, since the unit tests may not cover the complicated code, optimizing LLMs by using these unexecuted code snippets is ineffective. To tackle these challenges, we introduce StepCoder, a novel RL framework for code generation, consisting of two main components: CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks, while FGO only optimizes the model by masking the unexecuted code segments to provide Fine-Grained Optimization. In addition, we furthermore construct the APPS+ dataset for RL training, which is manually verified to ensure the correctness of unit tests. Experimental results show that our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks. The code and dataset will be made available upon publication.
LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
Shihan Dou | Enyu Zhou | Yan Liu | Songyang Gao | Wei Shen | Limao Xiong | Yuhao Zhou | Xiao Wang | Zhiheng Xi | Xiaoran Fan | Shiliang Pu | Jiang Zhu | Rui Zheng | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shihan Dou | Enyu Zhou | Yan Liu | Songyang Gao | Wei Shen | Limao Xiong | Yuhao Zhou | Xiao Wang | Zhiheng Xi | Xiaoran Fan | Shiliang Pu | Jiang Zhu | Rui Zheng | Tao Gui | Qi Zhang | Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Substantially increasing instruction data is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM. Our code is available at https://github.com/Ablustrund/LoRAMoE.
2023
RealBehavior: A Framework for Faithfully Characterizing Foundation Models’ Human-like Behavior Mechanisms
Enyu Zhou | Rui Zheng | Zhiheng Xi | Songyang Gao | Xiaoran Fan | Zichu Fei | Jingting Ye | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
Enyu Zhou | Rui Zheng | Zhiheng Xi | Songyang Gao | Xiaoran Fan | Zichu Fei | Jingting Ye | Tao Gui | Qi Zhang | Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
Reports of human-like behaviors in foundation models are growing, with psychological theories providing enduring tools to investigate these behaviors. However, current research tends to directly apply these human-oriented tools without verifying the faithfulness of their outcomes. In this paper, we introduce a framework, RealBehavior, which is designed to characterize the humanoid behaviors of models faithfully. Beyond simply measuring behaviors, our framework assesses the faithfulness of results based on reproducibility, internal and external consistency, and generalizability. Our findings suggest that a simple application of psychological tools cannot faithfully characterize all human-like behaviors. Moreover, we discuss the impacts of aligning models with human and social values, arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.
Search
Fix author
Co-authors
- Tao Gui 5
- Xuan-Jing Huang (黄萱菁) 5
- Zhiheng Xi 5
- Shihan Dou 4
- Xiaoran Fan 3
- Qi Zhang 3
- Rui Zheng 3
- Songyang Gao 2
- Caishuang Huang 2
- Yan Liu 2
- Xiao Wang 2
- Limao Xiong 2
- Qi Zhang 2
- Yuhao Zhou 2
- Mingxu Chai 1
- Zichu Fei 1
- Honglin Guo 1
- Zhenhua Han 1
- Kai Hu 1
- Chenhao Huang 1
- Tao Ji 1
- Haoxiang Jia 1
- Senjie Jin 1
- Jiahang Lin 1
- Shichun Liu 1
- Shiliang Pu 1
- Xipeng Qiu (邱锡鹏) 1
- Junjie Shan 1
- Wei Shen 1
- Binghai Wang 1
- Junzhe Wang 1
- Yuhui Wang 1
- Yuran Wang 1
- Hang Yan 1
- Jingting Ye 1
- Junjie Ye (叶俊杰) 1
- Guoqiang Zhang 1
- Jiazheng Zhang 1
- Ming Zhang 1
- Yunke Zhang 1
- Yuhao Zhou 1
- Dingwei Zhu 1
- Jiang Zhu 1