Minho Park
2022
Learning to Embed Multi-Modal Contexts for Situated Conversational Agents
Haeju Lee
|
Oh Joon Kwon
|
Yunseon Choi
|
Minho Park
|
Ran Han
|
Yoonhyung Kim
|
Jinhyeon Kim
|
Youngjune Lee
|
Haebin Shin
|
Kangwook Lee
|
Kee-Eung Kim
Findings of the Association for Computational Linguistics: NAACL 2022
The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.
Search
Co-authors
- Haeju Lee 1
- Oh Joon Kwon 1
- Yunseon Choi 1
- Ran Han 1
- Yoonhyung Kim 1
- show all...