TeleChat: An Open-source Billingual Large Language Model

Zihan Wang, XinZhang Liu, Shixuan Liu, Yitong Yao, Yunyao Huang, Mengxiang Li, Zhongjiang He, Yongxian Li, Luwen Pu, Huinan Xu, Chao Wang, Shuangyong Song


Abstract
In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community.
Anthology ID:
2024.sighan-1.2
Volume:
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, Runcong Zhao
Venues:
SIGHAN | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10–20
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2024.sighan-1.2/
DOI:
Bibkey:
Cite (ACL):
Zihan Wang, XinZhang Liu, Shixuan Liu, Yitong Yao, Yunyao Huang, Mengxiang Li, Zhongjiang He, Yongxian Li, Luwen Pu, Huinan Xu, Chao Wang, and Shuangyong Song. 2024. TeleChat: An Open-source Billingual Large Language Model. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 10–20, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
TeleChat: An Open-source Billingual Large Language Model (Wang et al., SIGHAN 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2024.sighan-1.2.pdf