Predicting Implicit Arguments in Procedural Video Instructions

Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller


Abstract
Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like verb,what,where/with. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step’s where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models’ contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
Anthology ID:
2025.acl-long.1467
Volume:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
30399–30419
Language:
URL:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1467/
DOI:
Bibkey:
Cite (ACL):
Anil Batra, Laura Sevilla-Lara, Marcus Rohrbach, and Frank Keller. 2025. Predicting Implicit Arguments in Procedural Video Instructions. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30399–30419, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
Predicting Implicit Arguments in Procedural Video Instructions (Batra et al., ACL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-acl-25/2025.acl-long.1467.pdf