Itamar Bar-Yossef
2025
Transparent and Coherent Procedural Mistake Detection
Shane Storks
|
Itamar Bar-Yossef
|
Yayuan Li
|
Zheyuan Zhang
|
Jason J Corso
|
Joyce Chai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.
2023
Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?
Yuwei Bao
|
Keunwoo Yu
|
Yichi Zhang
|
Shane Storks
|
Itamar Bar-Yossef
|
Alex de la Iglesia
|
Megan Su
|
Xiao Zheng
|
Joyce Chai
Findings of the Association for Computational Linguistics: EMNLP 2023
Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.
Search
Fix author
Co-authors
- Joyce Chai 2
- Shane Storks 2
- Yuwei Bao 1
- Jason J Corso 1
- Yayuan Li 1
- show all...