Feng Zhang
Other people with similar names: Feng Zhang, Feng Zhang
Unverified author pages with similar names: Feng Zhang
2026
STAPO: Selective Trajectory-Aware Policy Optimization for LLM Agent Training
Qiuyi Qi | Tian Liang | Mutian Bao | Jinjian Zhang | Dongnan Liu | Wei Zhou | Linjian Mo | Ming Kong | Jie Liu | Feng Zhang | Qiang Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiuyi Qi | Tian Liang | Mutian Bao | Jinjian Zhang | Dongnan Liu | Wei Zhou | Linjian Mo | Ming Kong | Jie Liu | Feng Zhang | Qiang Zhu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) is the dominant paradigm for training Large Language Model (LLM) agents on long-horizon tasks. However, sparse and delayed rewards often lead to trajectory neglect, in which agents lose focus on the task goal and interaction history at intermediate steps. Prior work has explored step-level supervision using Shannon-entropy–based uncertainty signals, which conflate inherent state complexity with agent confidence and therefore provide unreliable estimates of decision reliability. To address this issue, we propose normalized entropy, which measures confidence deviations relative to an agent’s average behavior under a given state, thereby strengthening the association between low-quality actions and trajectory neglect. Building on this insight, we introduce Selective Trajectory-Aware Policy Optimization (STAPO), a hierarchical group-based RL framework. STAPO leverages normalized entropy to locate outlier steps associated with trajectory neglect and optimizes them via a joint mechanism of trajectory-aware reward and trajectory-independent penalty, enhancing trajectory awareness while preserving training stability. Extensive experiments on ALFWorld, WebShop, and Search-Augmented QA demonstrate that STAPO achieves state-of-the-art performance while substantially alleviating trajectory neglect, validating its effectiveness and robustness for agentic tasks.