Multimodal Dialogue State Tracking

Hung Le, Nancy Chen, Steven Hoi


Abstract
Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.
Anthology ID:
2022.naacl-main.248
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3394–3415
Language:
URL:
https://aclanthology.org/2022.naacl-main.248
DOI:
10.18653/v1/2022.naacl-main.248
Bibkey:
Cite (ACL):
Hung Le, Nancy Chen, and Steven Hoi. 2022. Multimodal Dialogue State Tracking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3394–3415, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Multimodal Dialogue State Tracking (Le et al., NAACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.naacl-main.248.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-2/2022.naacl-main.248.mp4
Code
 henryhungle/mm_dst
Data
CATER