Abstract
Automated generation of conversational dialogue using modern neural architectures has made notable advances. However, these models are known to have a drawback of often producing uninteresting, predictable responses; this is known as the diversity problem. We introduce a new strategy to address this problem, called Diversity-Informed Data Collection. Unlike prior approaches, which modify model architectures to solve the problem, this method uses dynamically computed corpus-level statistics to determine which conversational participants to collect data from. Diversity-Informed Data Collection produces significantly more diverse data than baseline data collection methods, and better results on two downstream tasks: emotion classification and dialogue generation. This method is generalizable and can be used with other corpus-level metrics.- Anthology ID:
- 2020.acl-main.446
- Volume:
- Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
- Month:
- July
- Year:
- 2020
- Address:
- Online
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4958–4968
- Language:
- URL:
- https://aclanthology.org/2020.acl-main.446
- DOI:
- 10.18653/v1/2020.acl-main.446
- Cite (ACL):
- Katherine Stasaski, Grace Hui Yang, and Marti A. Hearst. 2020. More Diverse Dialogue Datasets via Diversity-Informed Data Collection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4958–4968, Online. Association for Computational Linguistics.
- Cite (Informal):
- More Diverse Dialogue Datasets via Diversity-Informed Data Collection (Stasaski et al., ACL 2020)
- PDF:
- https://preview.aclanthology.org/ingestion-script-update/2020.acl-main.446.pdf