AG-GRPO: Answer-Guided GRPO for Masked Diffusion Language Models

Juhyeong Kim, Gyunyeop Kim, Sangwoo Kang


Abstract
Reinforcement learning with verifiable rewards (RLVR) typically evaluates only final outcomes, providing limited learning signal about whether the generated reasoning is consistent with the correct answer. As a result, even when ground-truth answers are available during training, on-policy rollouts can repeatedly produce reasoning that is inconsistent with the answer.We propose Answer-Guided Group Relative Policy Optimization (AG-GRPO) for masked diffusion language models (dLLMs), which generate text through iterative masked-token restoration. AG-GRPO combines standard answer-free (AF) rollouts, sampled without access to the ground-truth answer, with answer-guided (AG) rollouts. In AG rollouts, the model generates reasoning conditioned on an anchored ground-truth answer suffix, and then re-predicts the answer from the generated reasoning for reward computation. We compute group-relative advantages over the combined AF/AG rollout set, allowing answer-guided training signals to improve the answer-free policy used at test time.Across mathematics, puzzle-solving, and code-generation benchmarks, AG-GRPO consistently improves over the pretrained dLLM and prior RL method for masked dLLMs. We further analyze optimization dynamics to study how shared group-relative advantages support signal transfer and affect convergence. Our code is available at https://github.com/JuHyng/ag_grpo.
Anthology ID:
2026.acl-long.1724
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
37175–37191
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1724/
DOI:
Bibkey:
Cite (ACL):
Juhyeong Kim, Gyunyeop Kim, and Sangwoo Kang. 2026. AG-GRPO: Answer-Guided GRPO for Masked Diffusion Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 37175–37191, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
AG-GRPO: Answer-Guided GRPO for Masked Diffusion Language Models (Kim et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1724.pdf
Checklist:
 2026.acl-long.1724.checklist.pdf