A Repository of Conversational Datasets
Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, Tsung-Hsien Wen
Abstract
Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches. To this end, we present a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using 1-of-100 accuracy. The repository contains scripts that allow researchers to reproduce the standard datasets, or to adapt the pre-processing and data filtering steps to their needs. We introduce and evaluate several competitive baselines for conversational response selection, whose implementations are shared in the repository, as well as a neural encoder model that is trained on the entire training set.- Anthology ID:
- W19-4101
- Volume:
- Proceedings of the First Workshop on NLP for Conversational AI
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1–10
- Language:
- URL:
- https://aclanthology.org/W19-4101
- DOI:
- 10.18653/v1/W19-4101
- Cite (ACL):
- Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, and Tsung-Hsien Wen. 2019. A Repository of Conversational Datasets. In Proceedings of the First Workshop on NLP for Conversational AI, pages 1–10, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- A Repository of Conversational Datasets (Henderson et al., ACL 2019)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/W19-4101.pdf
- Code
- PolyAI-LDN/conversational-datasets + additional community code
- Data
- Reddit Corpus, OpenSubtitles