TeleChat: An Open-source Billingual Large Language Model
Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, Shuangyong Song
Abstract
In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community.- Anthology ID:
- 2024.sighan-1.2
- Volume:
- Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, Runcong Zhao
- Venues:
- SIGHAN | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10–20
- Language:
- URL:
- https://aclanthology.org/2024.sighan-1.2
- DOI:
- Cite (ACL):
- Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, and Shuangyong Song. 2024. TeleChat: An Open-source Billingual Large Language Model. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 10–20, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- TeleChat: An Open-source Billingual Large Language Model (Wang et al., SIGHAN-WS 2024)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2024.sighan-1.2.pdf