Sangwoo Kang


2026

Diffusion large language models (dLLMs) generate text by repeatedly unmasking a partially noised sequence in parallel, promising lower latency than autoregressive decoding. However, most discrete dLLMs still rely on fixed denoising schedules, which are non-adaptive to input difficulty and cannot learn efficient unmasking orders. This paper introduces a reinforcement learning (RL) framework that transforms dLLM decoding into a trajectory-aware, learnable policy. We propose a confidence-gated denoising strategy that dynamically decides which tokens to unmask and how many to unmask per step, enabling adaptive exploration of denoising trajectories. Building on Group Relative Policy Optimization, we reformulate it into a trajectory-aware variant, TA-GRPO-d, which combines a trajectory-level signal—captured as the z-score of the AUC over intermediate rewards—with a token-level unmasking-time weight. This design allows the model to learn not only the final output quality but also the efficiency of the decoding path itself. Experiments on MATH-500, Countdown, Sudoku, and code benchmarks (HumanEval, MBPP) show that TA-GRPO-d maintains or improves accuracy while reducing average denoising steps by up to half, achieving both faster inference and lower computational cost. Our approach provides an RL framework for optimizing dLLM decoding policies toward adaptive, efficient reasoning. Code is available at our GitHub.
Reinforcement learning with verifiable rewards (RLVR) typically evaluates only final outcomes, providing limited learning signal about whether the generated reasoning is consistent with the correct answer. As a result, even when ground-truth answers are available during training, on-policy rollouts can repeatedly produce reasoning that is inconsistent with the answer.We propose Answer-Guided Group Relative Policy Optimization (AG-GRPO) for masked diffusion language models (dLLMs), which generate text through iterative masked-token restoration. AG-GRPO combines standard answer-free (AF) rollouts, sampled without access to the ground-truth answer, with answer-guided (AG) rollouts. In AG rollouts, the model generates reasoning conditioned on an anchored ground-truth answer suffix, and then re-predicts the answer from the generated reasoning for reward computation. We compute group-relative advantages over the combined AF/AG rollout set, allowing answer-guided training signals to improve the answer-free policy used at test time.Across mathematics, puzzle-solving, and code-generation benchmarks, AG-GRPO consistently improves over the pretrained dLLM and prior RL method for masked dLLMs. We further analyze optimization dynamics to study how shared group-relative advantages support signal transfer and affect convergence. Our code is available at https://github.com/JuHyng/ag_grpo.
Reinforcement learning with verifiable rewards has improved reasoning in language models, but it typically relies on a ground-truth answer or an external verifier, which limits applicability and increases cost. We propose an answer-free training objective that derives rewards solely from the model’s own probabilities by exploiting prompt paraphrases as multiple semantic views of the same intent. For each paraphrase set, we generate candidate responses, rescore each response under the other paraphrased prompts via teacher forcing, and define a cross-prompt consensus reward that serves as a practical internal training signal, favoring responses supported across views rather than those that fit only a single phrasing. We optimize this reward using a policy update with an all-pairs objective and advantage broadcasting across prompt–response pairs. The framework naturally supports prefix-level training, enabling a controllable cost–signal trade-off. Experiments on RobustAlpacaEval and out-of-domain reasoning benchmarks (OpenBookQA, AQuA, HumanEval) show strong in-domain gains and competitive or improved average out-of-domain performance over pre-trained and answer-free training baselines on LLaMA3.2-3B and Qwen3-4B, alongside analyses demonstrating reward–performance alignment and the importance of design choices such as excluding self-view scores and ensembling-based candidates. All experiment code is available at our GitHub.

2025

Autoregressive decoding in large language models (LLMs) necessitates a full forward pass for each generated token, significantly increasing inference latency. To address this limitation, we propose Fractal-LLM, a lossless self-speculative decoding method that embeds a compressed model within selected decoder layers of the original model. Specifically, our approach generates multiple draft tokens in parallel by injecting compressed layers into selected decoder layers. These draft tokens are subsequently verified through a single forward pass of the original model, ensuring the final outputs exactly match those produced by the original model. Experimental results across diverse benchmarks—including GSM8K, XSUM, CNN/DailyMail, and HumanEval—demonstrate that our method achieves substantial inference speed-ups (up to 2.47×) compared to standard autoregressive decoding, without requiring any additional training.
The performance of MoE-based LLMs depends on the router’s ability to select suitable experts; however, the router is typically not explicitly supervised to acquire this routing ability. We propose Exploration-Driven Reinforcement Learning (ERL), which explicitly optimizes the router by exploration of alternative routing paths. For every input, ERL evaluates by (i) the original routing path and (ii) paths in which an 𝛼-fraction of routing decisions is randomly perturbed, and treats their performance gap as an advantage signal in a reinforcement learning. Moreover, MoE-ERLwPL mitigates the risk of performance collapse caused by routing reinforcement learning–induced expert over-specialization by intentionally enforcing overlap in experts’ knowledge. Without adding parameters or external reward models, our method improves summarization (SAMSum, XSUM), question answering (SQuAD), and language modeling (WikiText-2), and raises routing quality, delivering up to 8.9 × higher MRR than baselines over 100 perturbed routing paths. Code is available at our github.

2015