Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding; Yulin Chen; Bokai Xu; Yujia Qin; Shengding Hu; Zhiyuan Liu; Maosong Sun; Bowen Zhou

doi:10.18653/v1/2023.emnlp-main.183

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, Bowen Zhou

Abstract

Fine-tuning on instruction data has been widely validated as an effective practice for implementing chat language models like ChatGPT. Scaling the diversity and quality of such data, although straightforward, stands a great chance of leading to improved performance. This paper aims to push the upper bound of open-source models further. We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat, which does not involve human queries. Our objective is to capture the breadth of interactions between a human user and an AI assistant and employs a comprehensive framework to generate multi-turn conversation iteratively. UltraChat contains 1.5 million high-quality multi-turn dialogues and covers a wide range of topics and instructions. Our statistical analysis of UltraChat reveals its superiority in various key metrics, including scale, average length, diversity, coherence, etc., solidifying its position as a leading open-source dataset. Building upon UltraChat, we fine-tune a LLaMA model to create a powerful conversational model, UltraLM. Our evaluations indicate that UltraLM consistently outperforms other open-source models, including WizardLM and Vicuna, the previously recognized state-of-the-art open-source models.

Anthology ID:: 2023.emnlp-main.183
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3029–3051
Language:
URL:: https://aclanthology.org/2023.emnlp-main.183
DOI:: 10.18653/v1/2023.emnlp-main.183
Bibkey:
Cite (ACL):: Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. 2023. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore. Association for Computational Linguistics.
Cite (Informal):: Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Ding et al., EMNLP 2023)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2023.emnlp-main.183.pdf
Video:: https://preview.aclanthology.org/nschneid-patch-1/2023.emnlp-main.183.mp4

PDF Search Video