More Diverse Dialogue Datasets via Diversity-Informed Data Collection

Katherine Stasaski, Grace Hui Yang, Marti A. Hearst


Abstract
Automated generation of conversational dialogue using modern neural architectures has made notable advances. However, these models are known to have a drawback of often producing uninteresting, predictable responses; this is known as the diversity problem. We introduce a new strategy to address this problem, called Diversity-Informed Data Collection. Unlike prior approaches, which modify model architectures to solve the problem, this method uses dynamically computed corpus-level statistics to determine which conversational participants to collect data from. Diversity-Informed Data Collection produces significantly more diverse data than baseline data collection methods, and better results on two downstream tasks: emotion classification and dialogue generation. This method is generalizable and can be used with other corpus-level metrics.
Anthology ID:
2020.acl-main.446
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4958–4968
Language:
URL:
https://aclanthology.org/2020.acl-main.446
DOI:
10.18653/v1/2020.acl-main.446
Bibkey:
Cite (ACL):
Katherine Stasaski, Grace Hui Yang, and Marti A. Hearst. 2020. More Diverse Dialogue Datasets via Diversity-Informed Data Collection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4958–4968, Online. Association for Computational Linguistics.
Cite (Informal):
More Diverse Dialogue Datasets via Diversity-Informed Data Collection (Stasaski et al., ACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.acl-main.446.pdf
Video:
 http://slideslive.com/38929100