What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan


Abstract
Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus.However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies.Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals.In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification.SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on mathematical reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets.
Anthology ID:
2026.acl-long.1337
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28957–28970
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1337/
DOI:
Bibkey:
Cite (ACL):
Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, and Tieniu Tan. 2026. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 28957–28970, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time (Yan et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1337.pdf
Checklist:
 2026.acl-long.1337.checklist.pdf