PEAP: Proactive Embodied Action Sequence Planning with Joint Understanding of Vision and Audio Perception

Tianwei Lan, Jiaqi Wu, Zeming Liu, Zhaoxin Fan, Haifeng Wang, Yuhang Guo


Abstract
Embodied Action Sequence Planning focuses on the capability of embodied agents to implement action planning via environmental perception. This technology enables diverse intelligent assistance for real-world scenarios such as home and office environments. To address the limitations of existing embodied agents in meeting the requirement for proactivity and achieving joint understanding of visual and audio information, this study investigates the ability of embodied agents to proactively provide assistance through action sequence planning based on joint understanding of vision and audio perception without explicit human instructions. Correspondingly, we propose PEAP, the first multimodal proactive embodied action sequence planning dataset. We evaluate the performance of multiple Large Language Models on the PEAP dataset. The results demonstrate that these models still exhibit significant deficiencies on this task particularly lacking accurate environmental perception capabilities. Furthermore, ablation experiment and replacement experiment further corroborate that the joint understanding of multimodal information can significantly improve the models’ performance on proactive embodied action sequence planning task.
Anthology ID:
2026.acl-long.1060
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23118–23138
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1060/
DOI:
Bibkey:
Cite (ACL):
Tianwei Lan, Jiaqi Wu, Zeming Liu, Zhaoxin Fan, Haifeng Wang, and Yuhang Guo. 2026. PEAP: Proactive Embodied Action Sequence Planning with Joint Understanding of Vision and Audio Perception. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23118–23138, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PEAP: Proactive Embodied Action Sequence Planning with Joint Understanding of Vision and Audio Perception (Lan et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1060.pdf
Checklist:
 2026.acl-long.1060.checklist.pdf