Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands

Claudiu Daniel Hromei; Antonio Scaiella; Danilo Croce; Roberto Basili

Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands

Claudiu Daniel Hromei, Antonio Scaiella, Danilo Croce, Roberto Basili

Abstract

Understanding natural language commands in situated Human-Robot Interaction (HRI) requires linking linguistic input to perceptual context. Traditional symbolic parsers lack the flexibility to operate in complex, dynamic environments. We introduce a novel Multimodal Grounded Semantic Role Labelling (G-SRL) framework that combines frame semantics with perceptual grounding, enabling robots to interpret commands via multimodal logical forms. Our approach leverages modern Visual Language Models (VLLMs), which jointly process text and images, and is supported by an automated pipeline that generates high-quality training data. Structured command annotations are converted into photorealistic scenes via LLM-guided prompt engineering and diffusion models, then rigorously validated through object detection and visual question answering. The pipeline produces over 11,000 image-command pairs (3,500+ manually validated), while approaching the quality of manually curated datasets at significantly lower cost.

Anthology ID:: 2025.emnlp-main.1212
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23758–23781
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1212/
DOI:
Bibkey:
Cite (ACL):: Claudiu Daniel Hromei, Antonio Scaiella, Danilo Croce, and Roberto Basili. 2025. Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23758–23781, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands (Hromei et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1212.pdf
Checklist:: 2025.emnlp-main.1212.checklist.pdf

PDF Cite Search Checklist Fix data