STAPO: Selective Trajectory-Aware Policy Optimization for LLM Agent Training

Qiuyi Qi; Tian Liang; Mutian Bao; Jinjian Zhang; Dongnan Liu; Wei Zhou; Linjian Mo; Ming Kong; Jie Liu; Feng Zhang; Qiang Zhu

STAPO: Selective Trajectory-Aware Policy Optimization for LLM Agent Training

Qiuyi Qi, Tian Liang, Mutian Bao, Jinjian Zhang, Dongnan Liu, Wei Zhou, Linjian Mo, Ming Kong, Jie Liu, Feng Zhang, Qiang Zhu

Abstract

Reinforcement Learning (RL) is the dominant paradigm for training Large Language Model (LLM) agents on long-horizon tasks. However, sparse and delayed rewards often lead to trajectory neglect, in which agents lose focus on the task goal and interaction history at intermediate steps. Prior work has explored step-level supervision using Shannon-entropy–based uncertainty signals, which conflate inherent state complexity with agent confidence and therefore provide unreliable estimates of decision reliability. To address this issue, we propose normalized entropy, which measures confidence deviations relative to an agent’s average behavior under a given state, thereby strengthening the association between low-quality actions and trajectory neglect. Building on this insight, we introduce Selective Trajectory-Aware Policy Optimization (STAPO), a hierarchical group-based RL framework. STAPO leverages normalized entropy to locate outlier steps associated with trajectory neglect and optimizes them via a joint mechanism of trajectory-aware reward and trajectory-independent penalty, enhancing trajectory awareness while preserving training stability. Extensive experiments on ALFWorld, WebShop, and Search-Augmented QA demonstrate that STAPO achieves state-of-the-art performance while substantially alleviating trajectory neglect, validating its effectiveness and robustness for agentic tasks.

Anthology ID:: 2026.acl-long.1308
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 28371–28392
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1308/
DOI:
Bibkey:
Cite (ACL):: Qiuyi Qi, Tian Liang, Mutian Bao, Jinjian Zhang, Dongnan Liu, Wei Zhou, Linjian Mo, Ming Kong, Jie Liu, Feng Zhang, and Qiang Zhu. 2026. STAPO: Selective Trajectory-Aware Policy Optimization for LLM Agent Training. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28371–28392, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: STAPO: Selective Trajectory-Aware Policy Optimization for LLM Agent Training (Qi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1308.pdf
Checklist:: 2026.acl-long.1308.checklist.pdf

PDF Cite Search Checklist Fix data