Daoxin Zhang


2026

Long-form outline generation requires satisfying multiple competing objectives simultaneously: outlines must be engaging, well-organized, topically relevant, and comprehensive while maintaining logical consistency across hierarchical structures. Current approaches either rely on expensive multi-turn interactions with large language models or employ procedural refinement pipelines that cannot systematically learn from critique. We present Logic-RL, a framework that transforms critique-guided outline refinement into a learnable policy through reinforcement learning. Our approach constructs refinement trajectories from teacher demonstrations, synthesizes explicit reasoning chains that decompose the critique-revision process, and optimizes a refinement policy using group relative policy optimization with structure-aware rewards. Experiments on FreshWiki and WikiOutline demonstrate that Logic-RL achieves substantial improvements over strong baselines, with the 0.6B model obtaining 79.17% relative gain and the 1.7B model achieving 8.67% improvement in average rubric scores compared to the best existing methods. Further analysis reveals that learned refinement policies generalize across domains and can be iteratively applied, with quality continuing to improve through three refinement rounds before diminishing returns.