STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, Wei Liu


Abstract
Multi-turn interaction remains challenging for online reinforcement learning. Current GRPO-based methods—either at the trajectory level or the step level—still suffer from fundamental challenges in multi-turn settings: they allocate sampling uniformly across tasks regardless of difficulty, propagate misleading learning signals that penalize correct intermediate actions in failed trajectories, and incur high sample-collection costs under long-horizon environments. Step-level variants (e.g., GIGPO) mitigate some interaction-cost constraints by decomposing trajectories, yet they retain GRPO’s sampling imbalance and still struggle with heterogeneous multi-turn tasks. To address these issues, we propose STEP (Success-rate-aware Trajectory-Efficient Policy Optimization), a framework that dynamically allocates sampling based on per-task success rates and performs fine-grained step-level optimization. STEP maintains a smoothed success-rate record to guide adaptive trajectory resampling, allocating more effort to harder tasks. It then computes success-rate-weighted advantages and decomposes trajectories into step-level samples, followed by a step-level GRPO augmentation that strengthens updates on low-success tasks. Experiments on OSWorld and AndroidWorld show that STEP substantially improves sample efficiency and training stability over both trajectory-level and existing step-level GRPO variants, converging faster and generalizing better under the same sampling budget.
Anthology ID:
2026.findings-acl.1532
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30681–30692
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1532/
DOI:
Bibkey:
Cite (ACL):
Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. 2026. STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 30681–30692, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization (Chen et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1532.pdf
Checklist:
 2026.findings-acl.1532.checklist.pdf