Abstract
Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment’s feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.- Anthology ID:
- 2024.emnlp-main.367
- Volume:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6401–6415
- Language:
- URL:
- https://aclanthology.org/2024.emnlp-main.367
- DOI:
- 10.18653/v1/2024.emnlp-main.367
- Cite (ACL):
- Qi Zhao, Haotian Fu, Chen Sun, and George Konidaris. 2024. EPO: Hierarchical LLM Agents with Environment Preference Optimization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6401–6415, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- EPO: Hierarchical LLM Agents with Environment Preference Optimization (Zhao et al., EMNLP 2024)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2024.emnlp-main.367.pdf