Speaker-adapted neural-network-based fusion for multimodal reference resolution

Diana Kleingarn, Nima Nabizadeh, Martin Heckmann, Dorothea Kolossa


Abstract
Humans use a variety of approaches to reference objects in the external world, including verbal descriptions, hand and head gestures, eye gaze or any combination of them. The amount of useful information from each modality, however, may vary depending on the specific person and on several other factors. For this reason, it is important to learn the correct combination of inputs for inferring the best-fitting reference. In this paper, we investigate appropriate speaker-dependent and independent fusion strategies in a multimodal reference resolution task. We show that without any change in the modality models, only through an optimized fusion technique, it is possible to reduce the error rate of the system on a reference resolution task by more than 50%.
Anthology ID:
W19-5925
Volume:
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue
Month:
September
Year:
2019
Address:
Stockholm, Sweden
Venue:
SIGDIAL
SIG:
SIGDIAL
Publisher:
Association for Computational Linguistics
Note:
Pages:
210–214
Language:
URL:
https://aclanthology.org/W19-5925
DOI:
10.18653/v1/W19-5925
Bibkey:
Cite (ACL):
Diana Kleingarn, Nima Nabizadeh, Martin Heckmann, and Dorothea Kolossa. 2019. Speaker-adapted neural-network-based fusion for multimodal reference resolution. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 210–214, Stockholm, Sweden. Association for Computational Linguistics.
Cite (Informal):
Speaker-adapted neural-network-based fusion for multimodal reference resolution (Kleingarn et al., SIGDIAL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/W19-5925.pdf