TeleChat: An Open-source Billingual Large Language Model

Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, Shuangyong Song


Abstract
In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community.
Anthology ID:
2024.sighan-1.2
Volume:
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, Runcong Zhao
Venues:
SIGHAN | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–20
Language:
URL:
https://aclanthology.org/2024.sighan-1.2
DOI:
Bibkey:
Cite (ACL):
Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, and Shuangyong Song. 2024. TeleChat: An Open-source Billingual Large Language Model. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 10–20, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
TeleChat: An Open-source Billingual Large Language Model (Wang et al., SIGHAN-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2024.sighan-1.2.pdf