Shilong Liu


2025

pdf bib
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu | Linyao Chen | Dai-Jie Wu | Yanjun Chen | Zecheng Zhang | Xiang Yao | Zhiqiang Xie | Yongchao Chen | Shilong Liu | Bochen Qian | Anjie Yang | Zhaoxuan Jin | Jianbo Deng | Philip Torr | Bernard Ghanem | Guohao Li
Findings of the Association for Computational Linguistics: ACL 2025

The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and thecomplexities of constructing tasks and evaluators. To overcome these limitations, we introduce CRAB, the first cross-environment agent benchmark framework, incorporating a graph-based fine-grained evaluation method and an efficient task generation method. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging CRAB, we develope CRAB Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated 6 advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.

2024

pdf bib
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li | Tiankai Yan | Yuanting Pan | Jie Luo | Ruiyang Ji | Jiayuan Ding | Zhe Xu | Shilong Liu | Haoyu Dong | Zihao Lin | Yixin Wang
Findings of the Association for Computational Linguistics: EMNLP 2024

Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named Multi-modal Medical Agent (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools.