Exploiting Tree Structure for Credit Assignment in Reinforcement Learning with Large Language Models

Hieu Tran, Zonghai Yao, Hong yu


Abstract
Reinforcement learning has shown strong promise for strengthening the reasoning ability of large language models (LLMs), but sparse, delayed rewards over long chains make token-level credit assignment a central challenge. Actor–critic methods like PPO provide token-level credit but require training a value network alongside the policy, which introduces complexity and can encourage overfitting. Critic-free alternatives such as GRPO avoid this burden but rely on sequence-level outcomes, distributing a single reward uniformly across tokens and ignoring structural differences between responses. We propose Prefix-to-Tree (P2T), which organizes the sampled responses of a prompt into a prefix tree and computes nonparametric prefix values by aggregating descendant outcomes. Building on this idea, we develop TEMPO (Tree-Estimated Mean Prefix Value for Policy Optimization), a critic-free algorithm that enriches GRPO with branch-aware temporal-difference (TD) corrections. Across Qwen3-1.7B and Qwen3-4B, TEMPO consistently improves both convergence and final performance over PPO and GRPO on in-distribution benchmarks (MATH, MedQA) and out-of-distribution settings (GSM-HARD, AMC23, MedMCQA, MMLU-Medical), achieving higher validation accuracy within comparable wall-clock time.
Anthology ID:
2026.findings-acl.524
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10795–10810
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.524/
DOI:
Bibkey:
Cite (ACL):
Hieu Tran, Zonghai Yao, and Hong yu. 2026. Exploiting Tree Structure for Credit Assignment in Reinforcement Learning with Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 10795–10810, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Exploiting Tree Structure for Credit Assignment in Reinforcement Learning with Large Language Models (Tran et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.524.pdf
Checklist:
 2026.findings-acl.524.checklist.pdf