Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, Joyce Chai


Abstract
Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.
Anthology ID:
2023.findings-emnlp.824
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12325–12341
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.824
DOI:
10.18653/v1/2023.findings-emnlp.824
Bibkey:
Cite (ACL):
Yuwei Bao, Keunwoo Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alex de la Iglesia, Megan Su, Xiao Zheng, and Joyce Chai. 2023. Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12325–12341, Singapore. Association for Computational Linguistics.
Cite (Informal):
Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake? (Bao et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.findings-emnlp.824.pdf