Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task
Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, Chitta Baral
Abstract
‘Actions’ play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform ‘Reasoning about Actions & Change’ (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP (Sampat et. al., 2021) is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.- Anthology ID:
- 2022.findings-emnlp.436
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2022
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5914–5924
- Language:
- URL:
- https://aclanthology.org/2022.findings-emnlp.436
- DOI:
- Cite (ACL):
- Shailaja Keyur Sampat, Pratyay Banerjee, Yezhou Yang, and Chitta Baral. 2022. Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5914–5924, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Cite (Informal):
- Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task (Sampat et al., Findings 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.findings-emnlp.436.pdf