Xiuqiang He


2026

Contemporary progress in Large Language Models (LLMs) has revealed notable inferential capacities via reinforcement learning (RL) employing verifiable rewards. However, “zero-RL” approaches relying on fixed prompt templates introduce substantial sampling inefficiencies for weak LLMs, as most problems generate invalid outputs during accuracy-driven filtration. To solve this, we propose Cog-Rethinker, a novel hierarchical metacognitive RL framework. Cog-Rethinker enhances the rollout procedure by improving sample utilization through a two-stage framework leveraging human cognition. First, it prompts the policy to decompose zero-accuracy problems into subproblems. Second, it prompts the policy to refine answers by referencing previous wrong solutions. Moreover, to enable cold-starts and maintain train-test consistency, Cog-Rethinker applies supervised fine-tuning using correct samples from these stages. Experimental results demonstrate Cog-Rethinker’s superior performance on mathematical reasoning benchmarks and its improved sample efficiency that accelerates convergence compared to baselines.