From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, Soujanya Poria


Abstract
Foundation models (FMs) are increasingly applied to bridge language and action in embodied agents, yet the operational characteristics of different integration strategies remain under-explored—especially for complex instruction following and versatile action generation in changing environments. We investigate three paradigms for robotic systems: end-to-end vision-language-action models (VLAs) that implicitly unify perception and planning, and modular pipelines using either vision-language models (VLMs) or multimodal large language models (MLLMs). Two case studies frame the comparison: instruction grounding, which probs fine-grained language understanding and cross-modal disambiguation; and object manipulation, which targets skill transfer via VLA finetuning. Our experiments reveal trade-offs in system scale, generalization and data efficiency. These findings indicate design lessons for language-driven physical agents and point to challenges and opportunities for FM-powered robotics in real-world conditions.
Anthology ID:
2025.findings-emnlp.69
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1324–1340
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.69/
DOI:
10.18653/v1/2025.findings-emnlp.69
Bibkey:
Cite (ACL):
Xiuchao Sui, Daiying Tian, Qi Sun, Ruirui Chen, Dongkyu Choi, Kenneth Kwok, and Soujanya Poria. 2025. From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1324–1340, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems (Sui et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.findings-emnlp.69.pdf
Checklist:
 2025.findings-emnlp.69.checklist.pdf