CPC-GRPO: Answer-Free Reinforcement Learning with Cross-Prompt Consensus Rewards

Gyunyeop Kim, Sangwoo Kang


Abstract
Reinforcement learning with verifiable rewards has improved reasoning in language models, but it typically relies on a ground-truth answer or an external verifier, which limits applicability and increases cost. We propose an answer-free training objective that derives rewards solely from the model’s own probabilities by exploiting prompt paraphrases as multiple semantic views of the same intent. For each paraphrase set, we generate candidate responses, rescore each response under the other paraphrased prompts via teacher forcing, and define a cross-prompt consensus reward that serves as a practical internal training signal, favoring responses supported across views rather than those that fit only a single phrasing. We optimize this reward using a policy update with an all-pairs objective and advantage broadcasting across prompt–response pairs. The framework naturally supports prefix-level training, enabling a controllable cost–signal trade-off. Experiments on RobustAlpacaEval and out-of-domain reasoning benchmarks (OpenBookQA, AQuA, HumanEval) show strong in-domain gains and competitive or improved average out-of-domain performance over pre-trained and answer-free training baselines on LLaMA3.2-3B and Qwen3-4B, alongside analyses demonstrating reward–performance alignment and the importance of design choices such as excluding self-view scores and ensembling-based candidates. All experiment code is available at our GitHub.
Anthology ID:
2026.findings-acl.1486
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29733–29748
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1486/
DOI:
Bibkey:
Cite (ACL):
Gyunyeop Kim and Sangwoo Kang. 2026. CPC-GRPO: Answer-Free Reinforcement Learning with Cross-Prompt Consensus Rewards. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29733–29748, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
CPC-GRPO: Answer-Free Reinforcement Learning with Cross-Prompt Consensus Rewards (Kim & Kang, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1486.pdf
Checklist:
 2026.findings-acl.1486.checklist.pdf