Provably Safe Offline-to-Online RL: Decoupling Learning from Data-Driven Safety Enforcement

Kaitong Cai, Jusheng Zhang, Keze Wang


Abstract
Hybrid offline–online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and online data. We introduce RLPD-GX, a framework that decouples policy optimization from safety enforcement: a reward-seeking learner explores freely, while a projection-based guardian guarantees rule-consistent execution and safe value backups. This design preserves the exploratory value of online interactions without collapsing to conservative policies. To further stabilize training, we propose dynamic curricula that gradually extend temporal horizons and anneal offline–online data mixing. We prove convergence via a contraction property of the guarded Bellman operator, and empirically show state-of-the-art performance on Atari-100k, achieving a normalized mean score of 3.02 (+45% over prior hybrid methods) with stronger safety and stability. Beyond Atari, ablations demonstrate consistent gains across safety-critical and long-horizon tasks, underscoring the generality of our design. Extensive and comprehensive results highlight decoupled safety enforcement as a simple yet principled route to robust O2O RL, suggesting a broader paradigm for reconciling exploration and safety in reinforcement learning.
Anthology ID:
2026.acl-long.528
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11517–11536
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.528/
DOI:
Bibkey:
Cite (ACL):
Kaitong Cai, Jusheng Zhang, and Keze Wang. 2026. Provably Safe Offline-to-Online RL: Decoupling Learning from Data-Driven Safety Enforcement. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11517–11536, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Provably Safe Offline-to-Online RL: Decoupling Learning from Data-Driven Safety Enforcement (Cai et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.528.pdf
Checklist:
 2026.acl-long.528.checklist.pdf