Embodied Executable Policy Learning with Language-based Scene Summarization
Jielin Qiu, Mengdi Xu, William Han, Seungwhan Moon, Ding Zhao
Abstract
Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning.However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments.In this work, we introduce a novel learning paradigm that generates robots’ executable actions in the form of text, derived solely from visual observations. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively.We conduct extensive experiments involving various model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.- Anthology ID:
- 2024.naacl-long.105
- Volume:
- Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Kevin Duh, Helena Gomez, Steven Bethard
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1896–1913
- Language:
- URL:
- https://aclanthology.org/2024.naacl-long.105
- DOI:
- Cite (ACL):
- Jielin Qiu, Mengdi Xu, William Han, Seungwhan Moon, and Ding Zhao. 2024. Embodied Executable Policy Learning with Language-based Scene Summarization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1896–1913, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Embodied Executable Policy Learning with Language-based Scene Summarization (Qiu et al., NAACL 2024)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2024.naacl-long.105.pdf