Abstract
The advent of instruction-tuned large language models (LLMs) has significantly advanced the field of automatic instruction dataset augmentation. However, the method of generating instructions and outputs from inherent knowledge of LLM can unintentionally produce hallucinations — instances of generating factually incorrect or misleading information. To overcome this, we propose SELF-EXPERTISE, automatically generating instruction dataset in the legal domain from a seed dataset. SELF-EXPERTISE extracts knowledge from the outputs of the seed dataset, and generates new instructions, inputs, and outputs. In this way, the proposed method reduces hallucination in automatic instruction augmentation. We trained an SELF-EXPERTISE augmented instruction dataset on the LLaMA-2 7B model to construct Korean legal specialized model, called LxPERT. LxPERT has demonstrated performance surpassing GPT-3.5-turbo in both in-domain and out-of-domain datasets. The SELF-EXPERTISE augmentation pipeline is not only applicable to the legal field but is also expected to be extendable to various domains, potentially advancing domain-specialized LLMs.- Anthology ID:
- 2024.findings-naacl.69
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2024
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Kevin Duh, Helena Gomez, Steven Bethard
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1098–1112
- Language:
- URL:
- https://aclanthology.org/2024.findings-naacl.69
- DOI:
- Cite (ACL):
- Minju Kim, Haein Jung, and Myoung-Wan Koo. 2024. SELF-EXPERTISE: Knowledge-based Instruction Dataset Augmentation for a Legal Expert Language Model. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1098–1112, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- SELF-EXPERTISE: Knowledge-based Instruction Dataset Augmentation for a Legal Expert Language Model (Kim et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/ingestion-checklist/2024.findings-naacl.69.pdf