Qipeng Huang
2026
Escaping the Echo Trap: On Credit Assignment Failure in Multi-turn LLM Self-Reflection
Linxuan Du | Guangquan Xue | Xiaobo Liang | Qipeng Huang | Yuyang Ding | Xinyu Shi | Zhang Yijun | Ji Qi | Wenpeng Zhu | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Linxuan Du | Guangquan Xue | Xiaobo Liang | Qipeng Huang | Yuyang Ding | Xinyu Shi | Zhang Yijun | Ji Qi | Wenpeng Zhu | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the potential of multi-turn self-reflection to improve LLM reasoning, its effectiveness in practice is severely constrained by a failure mode we term the Echo Trap.Specifically, this phenomenon gives rise to two coupled problems: (1) the model becomes limited by its inherent capabilities and tends to repeat earlier reflections to preserve reward signals; (2) once such “copy” behavior is reinforced, the model ceases to try new strategies, leading to exploration collapse.We attribute this issue to imprecise credit assignment during training, as standard GRPO assigns rewards at the trajectory level, making it difficult to distinguish which reflection steps contribute to improved outcomes.To address this limitation, we propose a tree-structured extension of GRPO for multi-turn self-reflection, which enables more accurate advantage estimation.Through extensive experiments, we analyze the Echo Trap and demonstrate that our method effectively mitigates behavior collapse and improves performance across multiple benchmarks.
DUAL RM: Beyond Rule-based Preference Reward Modeling via Meta-Reward
Xiaobo Liang | Wanfu Wang | Qipeng Huang | Yuyang Ding | Zecheng Tang | Yixin Ji | Qianben Chen | Zhe Zhao | Kehai Chen | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiaobo Liang | Wanfu Wang | Qipeng Huang | Yuyang Ding | Zecheng Tang | Yixin Ji | Qianben Chen | Zhe Zhao | Kehai Chen | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The ability to model sparse and underspecified rewards, characteristic of human preferences, is fundamental to scaling Reinforcement Learning (RL). Current preference-based reward modeling largely relies on verifiable rewards, where human-annotated labels define rule-based signals. However, these methods face a fundamental bottleneck we term the Matryoshka Doll Problem: a recursive dependency where each reward verifier requires a meta-verifier, leading to continuous and costly dependence on human annotation. In this work, we propose Dual RM, which couples discriminative and generative reward models (DisRMs and GenRMs) under a non-parametric meta-reward. Rather than verifying the correctness of GenRM’s reasoning, the meta-reward evaluates its practical impact on response quality. Specifically, GenRM identifies multi-dimensional evaluation rubrics and iteratively refines the response, while DisRM quantifies the quality shifts induced by each rubric. Furthermore, we implement rubric-based test-time scaling to improve sample efficiency and preference alignment under both DPO and GRPO. Our experiments demonstrate that Dual RM achieves strong performance across major preference benchmarks. Notably, even when trained exclusively on language modality, it exhibits robust cross-modal transfer on Omni-RewardBench.