Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog
Chia-Chien Hung, Anne Lauscher, Ivan Vulić, Simone Ponzetto, Goran Glavaš
Abstract
Research on (multi-domain) task-oriented dialog (TOD) has predominantly focused on the English language, primarily due to the shortage of robust TOD datasets in other languages, preventing the systematic investigation of cross-lingual transfer for this crucial NLP application area. In this work, we introduce Multi2WOZ, a new multilingual multi-domain TOD dataset, derived from the well-established English dataset MultiWOZ, that spans four typologically diverse languages: Chinese, German, Arabic, and Russian. In contrast to concurrent efforts, Multi2WOZ contains gold-standard dialogs in target languages that are directly comparable with development and test portions of the English dataset, enabling reliable and comparative estimates of cross-lingual transfer performance for TOD. We then introduce a new framework for multilingual conversational specialization of pretrained language models (PrLMs) that aims to facilitate cross-lingual transfer for arbitrary downstream TOD tasks. Using such conversational PrLMs specialized for concrete target languages, we systematically benchmark a number of zero-shot and few-shot cross-lingual transfer approaches on two standard TOD tasks: Dialog State Tracking and Response Retrieval. Our experiments show that, in most setups, the best performance entails the combination of (i) conversational specialization in the target language and (ii) few-shot transfer for the concrete TOD task. Most importantly, we show that our conversational specialization in the target language allows for an exceptionally sample-efficient few-shot transfer for downstream TOD tasks.- Anthology ID:
- 2022.naacl-main.270
- Volume:
- Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3687–3703
- Language:
- URL:
- https://aclanthology.org/2022.naacl-main.270
- DOI:
- 10.18653/v1/2022.naacl-main.270
- Cite (ACL):
- Chia-Chien Hung, Anne Lauscher, Ivan Vulić, Simone Ponzetto, and Goran Glavaš. 2022. Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3687–3703, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog (Hung et al., NAACL 2022)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2022.naacl-main.270.pdf
- Code
- umanlp/multi2woz
- Data
- CCNet, MultiWOZ, OpenSubtitles