Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog

Jiaping Zhang, Tiancheng Zhao, Zhou Yu


Abstract
Creating an intelligent conversational system that understands vision and language is one of the ultimate goals in Artificial Intelligence (AI) (Winograd, 1972). Extensive research has focused on vision-to-language generation, however, limited research has touched on combining these two modalities in a goal-driven dialog context. We propose a multimodal hierarchical reinforcement learning framework that dynamically integrates vision and language for task-oriented visual dialog. The framework jointly learns the multimodal dialog state representation and the hierarchical dialog policy to improve both dialog task success and efficiency. We also propose a new technique, state adaptation, to integrate context awareness in the dialog state representation. We evaluate the proposed framework and the state adaptation technique in an image guessing game and achieve promising results.
Anthology ID:
W18-5015
Volume:
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
140–150
Language:
URL:
https://aclanthology.org/W18-5015
DOI:
10.18653/v1/W18-5015
Bibkey:
Cite (ACL):
Jiaping Zhang, Tiancheng Zhao, and Zhou Yu. 2018. Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 140–150, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog (Zhang et al., SIGDIAL 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W18-5015.pdf
Data
VQGVisDial