A Repository of Conversational Datasets

Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, Tsung-Hsien Wen


Abstract
Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches. To this end, we present a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using 1-of-100 accuracy. The repository contains scripts that allow researchers to reproduce the standard datasets, or to adapt the pre-processing and data filtering steps to their needs. We introduce and evaluate several competitive baselines for conversational response selection, whose implementations are shared in the repository, as well as a neural encoder model that is trained on the entire training set.
Anthology ID:
W19-4101
Volume:
Proceedings of the First Workshop on NLP for Conversational AI
Month:
August
Year:
2019
Address:
Florence, Italy
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/W19-4101
DOI:
10.18653/v1/W19-4101
Bibkey:
Cite (ACL):
Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, and Tsung-Hsien Wen. 2019. A Repository of Conversational Datasets. In Proceedings of the First Workshop on NLP for Conversational AI, pages 1–10, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
A Repository of Conversational Datasets (Henderson et al., ACL 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/W19-4101.pdf
Code
 PolyAI-LDN/conversational-datasets +  additional community code
Data
Reddit CorpusOpenSubtitles