Xuesong Wang
2026
FAER: Benchmarking VLMs for Failure-Aware Embodied Reasoning
Hao Song | Kaifeng Liu | Yuanxing Liu | Xiang Tian | Xuesong Wang | Chen Yifan | Weinan Zhang | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Hao Song | Kaifeng Liu | Yuanxing Liu | Xiang Tian | Xuesong Wang | Chen Yifan | Weinan Zhang | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Failures are inevitable when embodied agents execute complex tasks. Visual-language models (VLMs) serve as the core component of embodied agents in perceiving the environment and making decisions. Assessing the capabilities of VLMs in detecting and reasoning about failures has become increasingly important. Previous work primarily considered low-level manipulation failures (e.g., 3cm grasp offsets), neglecting high-level failures arising during long-horizon task execution (e.g., object-dropping failure in the “clean room” task) by embodied agents. In this paper, we propose FAER, a failure-aware benchmark aiming to evaluate the performance of VLMs in terms of failure detection, failure categorization, failure description, and failure correction in long-horizon tasks. FAER comprises 3,323 episodes, spanning 3 scenes, 65 tasks, and 83 objects. We assess the performance of 16 widely utilized VLMs and 4 LLMs for FAER tasks. Experimental results show that nearly all VLMs, even GPT-4o, exhibit limited performance in failure detection with a high false negative rate, meaning that they tend to ignore abnormal events, revealing notable gaps in current models’ capacity to effectively handle failures.