Visual Goal-Step Inference using wikiHow
Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, Chris Callison-Burch
Abstract
Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events.- Anthology ID:
- 2021.emnlp-main.165
- Original:
- 2021.emnlp-main.165v1
- Version 2:
- 2021.emnlp-main.165v2
- Volume:
- Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
- Month:
- November
- Year:
- 2021
- Address:
- Online and Punta Cana, Dominican Republic
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2167–2179
- Language:
- URL:
- https://aclanthology.org/2021.emnlp-main.165
- DOI:
- 10.18653/v1/2021.emnlp-main.165
- Cite (ACL):
- Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar, and Chris Callison-Burch. 2021. Visual Goal-Step Inference using wikiHow. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2167–2179, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Visual Goal-Step Inference using wikiHow (Yang et al., EMNLP 2021)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2021.emnlp-main.165.pdf
- Code
- yueyang1996/wikihow-vgsi
- Data
- wikiHow-image, COIN, HowTo100M