Mengyuan Sun

2026

Traditional reinforcement learning from human feedback (RLHF) optimizes policies on fixed training inputs, limiting the diversity of learning signals. We propose JODP (Joint Optimization of Data and Policy), a framework where the evolving policy model generates improved variants of training problems to enhance its own learning. While training problems remain fixed, JODP optimizes how they are presented: the policy generates specification hints that guide rollout generation, then learns to reproduce the discovered high-reward behaviors without the hints. This "if you can solve it with a hint, learn to solve it without one" principle creates a co-evolutionary dynamic where better policies discover better specifications, which enable further policy improvement. JODP operates as a plug-and-play enhancement to existing algorithms: specifications are selected via UCB bandits for exploration-exploitation balance, used only during training rollouts, and discarded at deployment. Through evaluation on safety alignment tasks, we demonstrate consistent improvements with GRPO, RLOO, and REINFORCE++, allowing 4B models to approach 8B model performance using less than 1% additional computational overhead.

2025

pdf bib abs

Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning.These findings underscore the need for continuous advancements in LLM reasoning capabilities.

Co-authors

Wei Ye 1

Venues

Findings2

Fix author