Policy-Guided Stepwise Action Planning for Controllable LLM Reasoning

Jianpeng Zhou, Qisheng Hu, Jiahai Wang, Wenya Wang


Abstract
Steering large language model (LLM) reasoning via high-level reasoning actions offers a promising approach to improve robustness and interpretability. However, existing action-based paradigms, ranging from training-free prompting to static plan retrieval or prediction, often fail to consistently outperform standard generation because their planners tend to degenerate into repetitive loops or fixed patterns. We propose PG-HAP (Policy-Guided High-Level Action Planning), a lightweight stepwise planner–executor framework that learns to select reasoning actions dynamically while keeping the executor LLM fully frozen. The planner is trained with reinforcement learning to optimize answer correctness. To prevent degeneration, we introduce two targeted mechanisms: (i) an Action-Dependency Logit Mask that enforces valid transitions to avoid redundancy, and (ii) an Action Diversity Reward that discourages mode collapse by promoting varied action sequences. Across mathematical and commonsense reasoning benchmarks, PG-HAP improves accuracy over strong baselines while producing less redundant, more adaptive trajectories. This demonstrates that learning high-level planning alone can substantially strengthen reasoning without expensive end-to-end model tuning.
Anthology ID:
2026.findings-acl.2024
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40740–40765
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2024/
DOI:
Bibkey:
Cite (ACL):
Jianpeng Zhou, Qisheng Hu, Jiahai Wang, and Wenya Wang. 2026. Policy-Guided Stepwise Action Planning for Controllable LLM Reasoning. In Findings of the Association for Computational Linguistics: ACL 2026, pages 40740–40765, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Policy-Guided Stepwise Action Planning for Controllable LLM Reasoning (Zhou et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.2024.pdf
Checklist:
 2026.findings-acl.2024.checklist.pdf