A Repository of Conversational Datasets

Matthew Henderson; Paweł Budzianowski; Iñigo Casanueva; Sam Coope; Daniela Gerz; Girish Kumar; Nikola Mrkšić; Georgios Spithourakis; Pei-Hao Su; Ivan Vulić; Tsung-Hsien Wen

doi:10.18653/v1/W19-4101

A Repository of Conversational Datasets

Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, Tsung-Hsien Wen

Abstract

Progress in Machine Learning is often driven by the availability of large datasets, and consistent evaluation metrics for comparing modeling approaches. To this end, we present a repository of conversational datasets consisting of hundreds of millions of examples, and a standardised evaluation procedure for conversational response selection models using 1-of-100 accuracy. The repository contains scripts that allow researchers to reproduce the standard datasets, or to adapt the pre-processing and data filtering steps to their needs. We introduce and evaluate several competitive baselines for conversational response selection, whose implementations are shared in the repository, as well as a neural encoder model that is trained on the entire training set.

Anthology ID:: W19-4101
Volume:: Proceedings of the First Workshop on NLP for Conversational AI
Month:: August
Year:: 2019
Address:: Florence, Italy
Editors:: Yun-Nung Chen, Tania Bedrax-Weiss, Dilek Hakkani-Tur, Anuj Kumar, Mike Lewis, Thang-Minh Luong, Pei-Hao Su, Tsung-Hsien Wen
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–10
Language:
URL:: https://aclanthology.org/W19-4101
DOI:: 10.18653/v1/W19-4101
Bibkey:
Cite (ACL):: Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, and Tsung-Hsien Wen. 2019. A Repository of Conversational Datasets. In Proceedings of the First Workshop on NLP for Conversational AI, pages 1–10, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: A Repository of Conversational Datasets (Henderson et al., ACL 2019)
Copy Citation:
PDF:: https://preview.aclanthology.org/emnlp-22-attachments/W19-4101.pdf
Code: PolyAI-LDN/conversational-datasets + additional community code
Data: Reddit Corpus, OpenSubtitles, Reddit

PDF Cite Search Code