Faithful Persona-based Conversational Dataset Generation with Large Language Models
Pegah Jandaghi, Xianghai Sheng, Xinyi Bai, Jay Pujara, Hakim Sidahmed
Abstract
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user’s character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during an AI detection test decreases from 17.2% to 8.8% over three iterations.- Anthology ID:
- 2024.nlp4convai-1.8
- Volume:
- Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Elnaz Nouri, Abhinav Rastogi, Georgios Spithourakis, Bing Liu, Yun-Nung Chen, Yu Li, Alon Albalak, Hiromi Wakaki, Alexandros Papangelis
- Venues:
- NLP4ConvAI | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 114–139
- Language:
- URL:
- https://aclanthology.org/2024.nlp4convai-1.8
- DOI:
- Cite (ACL):
- Pegah Jandaghi, Xianghai Sheng, Xinyi Bai, Jay Pujara, and Hakim Sidahmed. 2024. Faithful Persona-based Conversational Dataset Generation with Large Language Models. In Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024), pages 114–139, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Faithful Persona-based Conversational Dataset Generation with Large Language Models (Jandaghi et al., NLP4ConvAI-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.nlp4convai-1.8.pdf