Aochong Oliver Li

2026

Better LLM Reasoning via Dual-Play
Zhengxin Zhang | Chengyu Huang | Aochong Oliver Li | Claire Cardie
Findings of the Association for Computational Linguistics: ACL 2026

Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to learn from themselves—thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions’ quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver’s limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. Experimental results show that PasoDoble can improve the math reasoning performance of LLMs.

2025

pdf bib abs

Memorization vs. Reasoning: Updating LLMs with New Knowledge
Aochong Oliver Li | Tanya Goyal
Findings of the Association for Computational Linguistics: ACL 2025

Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpus. KUP’s evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated ”memory” tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two LLM families show that (1) KUP benchmark is highly challenging, with the best CPT models achieving <2% in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to 25.4%.

Co-authors

Venues

Findings2

Fix author