An Empirical Study of Instruction-tuning Large Language Models in Chinese
Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, Weiping Wang
Abstract
The success of ChatGPT validates the potential of large language models (LLMs) in artificial general intelligence (AGI). Subsequently, the release of LLMs has sparked the open-source community’s interest in instruction-tuning, which is deemed to accelerate ChatGPT’s replication process. However, research on instruction-tuning LLMs in Chinese, the world’s most spoken language, is still in its early stages. Therefore, this paper makes an in-depth empirical study of instruction-tuning LLMs in Chinese, which can serve as a cookbook that provides valuable findings for effectively customizing LLMs that can better respond to Chinese instructions. Specifically, we systematically explore the impact of LLM bases, parameter-efficient methods, instruction data types, which are the three most important elements for instruction-tuning. Besides, we also conduct experiment to study the impact of other factors, e.g., chain-of-thought data and human-value alignment. We hope that this empirical study can make a modest contribution to the open Chinese version of ChatGPT. This paper will release a powerful Chinese LLM that is comparable to ChatGLM. The code and data are available at https: //github.com/PhoebusSi/Alpaca-CoT.- Anthology ID:
- 2023.findings-emnlp.269
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2023
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 4086–4107
- Language:
- URL:
- https://aclanthology.org/2023.findings-emnlp.269
- DOI:
- 10.18653/v1/2023.findings-emnlp.269
- Cite (ACL):
- Qingyi Si, Tong Wang, Zheng Lin, Xu Zhang, Yanan Cao, and Weiping Wang. 2023. An Empirical Study of Instruction-tuning Large Language Models in Chinese. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4086–4107, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- An Empirical Study of Instruction-tuning Large Language Models in Chinese (Si et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/emnlp22-frontmatter/2023.findings-emnlp.269.pdf