Linxuan Du
2026
Escaping the Echo Trap: On Credit Assignment Failure in Multi-turn LLM Self-Reflection
Linxuan Du | Guangquan Xue | Xiaobo Liang | Qipeng Huang | Yuyang Ding | Xinyu Shi | Zhang Yijun | Ji Qi | Wenpeng Zhu | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Linxuan Du | Guangquan Xue | Xiaobo Liang | Qipeng Huang | Yuyang Ding | Xinyu Shi | Zhang Yijun | Ji Qi | Wenpeng Zhu | Juntao Li | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the potential of multi-turn self-reflection to improve LLM reasoning, its effectiveness in practice is severely constrained by a failure mode we term the Echo Trap.Specifically, this phenomenon gives rise to two coupled problems: (1) the model becomes limited by its inherent capabilities and tends to repeat earlier reflections to preserve reward signals; (2) once such “copy” behavior is reinforced, the model ceases to try new strategies, leading to exploration collapse.We attribute this issue to imprecise credit assignment during training, as standard GRPO assigns rewards at the trajectory level, making it difficult to distinguish which reflection steps contribute to improved outcomes.To address this limitation, we propose a tree-structured extension of GRPO for multi-turn self-reflection, which enables more accurate advantage estimation.Through extensive experiments, we analyze the Echo Trap and demonstrate that our method effectively mitigates behavior collapse and improves performance across multiple benchmarks.