On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

Di Wu; Jiaxin Fan; Chloe Gu; Guanbo Wang; Wei Yin; Wenhao Li; Bo Jin

On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning

Di Wu, Jiaxin Fan, Chloe Gu, Guanbo Wang, Wei Yin, Wenhao Li, Bo Jin

Abstract

Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and verbal goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle in interactive environments. Reinforcement learning (RL) offers a natural way to address this limitation, yet online RL approaches suffer from costly interaction and sparse rewards in embodied settings. This paper introduces ORBIT, an On-policy Reinforcement fine-tuning (RFT) framework with offline rewards for EmBodIed Task Planning, that preserves the generalization benefits of RFT while addressing the challenges of costly interaction and sparse rewards, supported by solid theoretical guarantees. Our approach is evaluated on EmbodiedBench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that ORBIT achieves SOTA performance on EB-ALFRED, outperforming all closed-source and online-RL-based methods, while being substantially more efficient in training speed and computational cost, remaining robust to sub-optimal expert trajectories, and exhibiting strong generalization to unseen environments. We released all code and data at https://github.com/mail-taii/Reinforced-Reasoning-for-Embodied-Planning.

Anthology ID:: 2026.acl-long.1822
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 39277–39307
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1822/
DOI:
Bibkey:
Cite (ACL):: Di Wu, Jiaxin Fan, Chloe Gu, Guanbo Wang, Wei Yin, Wenhao Li, and Bo Jin. 2026. On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39277–39307, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: On-policy Reinforcement Fine-tuning with Offline reward for Multi-step Embodied Planning (Wu et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1822.pdf
Checklist:: 2026.acl-long.1822.checklist.pdf

PDF Cite Search Checklist Fix data