@inproceedings{zhao-etal-2024-epo,
    title = "{EPO}: Hierarchical {LLM} Agents with Environment Preference Optimization",
    author = "Zhao, Qi  and
      Fu, Haotian  and
      Sun, Chen  and
      Konidaris, George",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2024.emnlp-main.367/",
    doi = "10.18653/v1/2024.emnlp-main.367",
    pages = "6401--6415",
    abstract = "Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps. In this paper, we propose a hierarchical framework that decomposes complex tasks into manageable subgoals, utilizing separate LLMs for subgoal prediction and low-level action generation. To address the challenge of creating training signals for unannotated datasets, we develop a reward model that leverages multimodal environment feedback to automatically generate reward signals. We introduce Environment Preference Optimization (EPO), a novel method that generates preference signals from the environment{'}s feedback and uses them to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments."
}Markdown (Informal)
[EPO: Hierarchical LLM Agents with Environment Preference Optimization](https://preview.aclanthology.org/ingest-emnlp/2024.emnlp-main.367/) (Zhao et al., EMNLP 2024)
ACL