KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Dongjun Jang; Sungjoo Byun; Hyemi Jo; Hyopil Shin

KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models

Dongjun Jang, Sungjoo Byun, Hyemi Jo, Hyopil Shin

Abstract

Instruction Tuning on Large Language Models is an essential process for model to function well and achieve high performance in the specific tasks. Accordingly, in mainstream languages such as English, instruction-based datasets are being constructed and made publicly available. In the case of Korean, publicly available models and datasets all rely on using the output of ChatGPT or translating datasets built in English. In this paper, We introduce KIT-19 as an instruction dataset for the development of LLM in Korean. KIT-19 is a dataset created in an instruction format, comprising 19 existing open-source datasets for Korean NLP tasks. In this paper, we train a Korean Pretrained LLM using KIT-19 to demonstrate its effectiveness. The experimental results show that the model trained on KIT-19 significantly outperforms existing Korean LLMs. Based on the its quality and empirical results, this paper proposes that KIT-19 has the potential to make a substantial contribution to the future improvement of Korean LLMs’ performance.

Anthology ID:: 2024.lrec-main.853
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 9764–9776
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.lrec-main.853/
DOI:
Bibkey:
Cite (ACL):: Dongjun Jang, Sungjoo Byun, Hyemi Jo, and Hyopil Shin. 2024. KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9764–9776, Torino, Italia. ELRA and ICCL.
Cite (Informal):: KIT-19: A Comprehensive Korean Instruction Toolkit on 19 Tasks for Fine-Tuning Korean Large Language Models (Jang et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.lrec-main.853.pdf

PDF Cite Search Fix data