Kaifeng Liu
2026
FAER: Benchmarking VLMs for Failure-Aware Embodied Reasoning
Hao Song | Kaifeng Liu | Yuanxing Liu | Xiang Tian | Xuesong Wang | Chen Yifan | Weinan Zhang | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Hao Song | Kaifeng Liu | Yuanxing Liu | Xiang Tian | Xuesong Wang | Chen Yifan | Weinan Zhang | Ting Liu
Findings of the Association for Computational Linguistics: ACL 2026
Failures are inevitable when embodied agents execute complex tasks. Visual-language models (VLMs) serve as the core component of embodied agents in perceiving the environment and making decisions. Assessing the capabilities of VLMs in detecting and reasoning about failures has become increasingly important. Previous work primarily considered low-level manipulation failures (e.g., 3cm grasp offsets), neglecting high-level failures arising during long-horizon task execution (e.g., object-dropping failure in the “clean room” task) by embodied agents. In this paper, we propose FAER, a failure-aware benchmark aiming to evaluate the performance of VLMs in terms of failure detection, failure categorization, failure description, and failure correction in long-horizon tasks. FAER comprises 3,323 episodes, spanning 3 scenes, 65 tasks, and 83 objects. We assess the performance of 16 widely utilized VLMs and 4 LLMs for FAER tasks. Experimental results show that nearly all VLMs, even GPT-4o, exhibit limited performance in failure detection with a high false negative rate, meaning that they tend to ignore abnormal events, revealing notable gaps in current models’ capacity to effectively handle failures.
MARS2: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
Pengfei Li | Shijie Wang | Fangyuan Li | Yikun Fu | Kaifeng Liu | Kaiyan Zhang | Dazhi Zhang | Yuqiang Li | Biqing Qi | Bowen Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Pengfei Li | Shijie Wang | Fangyuan Li | Yikun Fu | Kaifeng Liu | Kaiyan Zhang | Dazhi Zhang | Yuqiang Li | Biqing Qi | Bowen Zhou
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement learning (RL) paradigms have demonstrated strong performance on reasoning-intensive tasks such as code generation. However, limited trajectory diversity often leads to diminishing returns, which constrains the achievable performance ceiling. Search-enhanced RL alleviates this issue by introducing structured exploration, which remains constrained by the single-agent policy priors. Meanwhile, leveraging multiple interacting policies can acquire more diverse exploratory signals, but existing approaches are typically decoupled from structured search. We propose MARS2 (Multi-Agent Reinforced Tree-Search Scaling), a unified RL framework in which multiple independently-optimized agents collaborate within a shared tree-structured search environment. MARS2 models the search tree as a learnable multi-agent interaction environment, enabling heterogeneous agents to collaboratively generate and refine candidate solutions within a shared search topology. To support effective learning, we introduce a path-level group advantage formulation based on tree-consistent reward shaping, which facilitates effective credit assignment across complex search trajectories. Experiments on code generation benchmarks show that MARS2 consistently improves performance across diverse model combinations and training settings, demonstrating the effectiveness of coupling multi-agent collaboration with tree search for enhancing reinforcement learning. Our code is publicly available at https://github.com/TsinghuaC3I/MARTI.