Better LLM Reasoning via Dual-Play

Zhengxin Zhang; Chengyu Huang; Aochong Oliver Li; Claire Cardie

Better LLM Reasoning via Dual-Play

Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie

Abstract

Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to learn from themselves—thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. Experimental results show that PasoDoble can improve the math reasoning performance of LLMs.

Anthology ID:: 2026.findings-acl.1752
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35111–35139
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1752/
DOI:
Bibkey:
Cite (ACL):: Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, and Claire Cardie. 2026. Better LLM Reasoning via Dual-Play. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35111–35139, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Better LLM Reasoning via Dual-Play (Zhang et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1752.pdf
Checklist:: 2026.findings-acl.1752.checklist.pdf

PDF Cite Search Checklist Fix data