Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

Peter Jansen

doi:10.18653/v1/2020.findings-emnlp.395

Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions

Abstract

The recently proposed ALFRED challenge task aims for a virtual robotic agent to complete complex multi-step everyday tasks in a virtual home environment from high-level natural language directives, such as “put a hot piece of bread on a plate”. Currently, the best-performing models are able to complete less than 1% of these tasks successfully. In this work we focus on modeling the translation problem of converting natural language directives into detailed multi-step sequences of actions that accomplish those goals in the virtual environment. We empirically demonstrate that it is possible to generate gold multi-step plans from language directives alone without any visual input in 26% of unseen cases. When a small amount of visual information, the starting location in the virtual environment, is incorporated, our best-performing GPT-2 model successfully generates gold command sequences in 58% of cases, suggesting contextualized language models may provide strong planning modules for grounded virtual agents.

Anthology ID:: 2020.findings-emnlp.395
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4412–4417
Language:
URL:: https://preview.aclanthology.org/build-pipeline-with-new-library/2020.findings-emnlp.395/
DOI:: 10.18653/v1/2020.findings-emnlp.395
Bibkey:
Cite (ACL):: Peter Jansen. 2020. Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4412–4417, Online. Association for Computational Linguistics.
Cite (Informal):: Visually-Grounded Planning without Vision: Language Models Infer Detailed Plans from High-level Instructions (Jansen, Findings 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/build-pipeline-with-new-library/2020.findings-emnlp.395.pdf
Video:: https://slideslive.com/38940098
Code: cognitiveailab/alfred-gpt2
Data: ALFRED

PDF Search Code Video Fix metadata