Xiuchao Sui
2025
From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems
Xiuchao Sui
|
Daiying Tian
|
Qi Sun
|
Ruirui Chen
|
Dongkyu Choi
|
Kenneth Kwok
|
Soujanya Poria
Findings of the Association for Computational Linguistics: EMNLP 2025
Foundation models (FMs) are increasingly applied to bridge language and action in embodied agents, yet the operational characteristics of different integration strategies remain under-explored—especially for complex instruction following and versatile action generation in changing environments. We investigate three paradigms for robotic systems: end-to-end vision-language-action models (VLAs) that implicitly unify perception and planning, and modular pipelines using either vision-language models (VLMs) or multimodal large language models (MLLMs). Two case studies frame the comparison: instruction grounding, which probs fine-grained language understanding and cross-modal disambiguation; and object manipulation, which targets skill transfer via VLA finetuning. Our experiments reveal trade-offs in system scale, generalization and data efficiency. These findings indicate design lessons for language-driven physical agents and point to challenges and opportunities for FM-powered robotics in real-world conditions.
Search
Fix author
Co-authors
- Ruirui Chen 1
- Dongkyu Choi 1
- Kenneth Kwok 1
- Soujanya Poria 1
- Qi Sun 1
- show all...