SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation
Bhathiya Hemanthage, Christian Dondrup, Phil Bartie, Oliver Lemon
Abstract
SimpleMTOD is a simple language model which recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. SimpleMTOD is built on a large-scale transformer-based auto-regressive architecture, which has already proven to be successful in uni-modal task-oriented dialogues, and effectively leverages transfer learning from pretrained GPT-2. In-order to capture the semantics of visual scenes, we introduce both local and de-localized tokens for objects within a scene. De-localized tokens represent the type of an object rather than the specific object itself and so possess a consistent meaning across the dataset. SimpleMTOD achieves a state-of-the-art BLEU score (0.327) in the Response Generation sub-task of the SIMMC 2.0 test-std dataset while performing on par in other multimodal sub-tasks: Disambiguation, Coreference Resolution, and Dialog State Tracking. This is despite taking a minimalist approach for extracting visual (and non-visual) informa- tion. In addition the model does not rely on task-specific architectural changes such as classification heads.- Anthology ID:
- 2023.iwcs-1.31
- Volume:
- Proceedings of the 15th International Conference on Computational Semantics
- Month:
- June
- Year:
- 2023
- Address:
- Nancy, France
- Editors:
- Maxime Amblard, Ellen Breitholtz
- Venue:
- IWCS
- SIG:
- SIGSEM
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 293–304
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2023.iwcs-1.31/
- DOI:
- Cite (ACL):
- Bhathiya Hemanthage, Christian Dondrup, Phil Bartie, and Oliver Lemon. 2023. SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation. In Proceedings of the 15th International Conference on Computational Semantics, pages 293–304, Nancy, France. Association for Computational Linguistics.
- Cite (Informal):
- SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation (Hemanthage et al., IWCS 2023)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2023.iwcs-1.31.pdf