Abstract
Collecting high-quality question-answer (QA) pairs is vital for the training of large language models (LLMs), yet this process is traditionally laborious and time-intensive. With the rapid evolution of LLMs, the potential for leveraging these models to autonomously generate QA pairs has become apparent, particularly through the use of large-scale models like GPT-4. However, the computational demands and associated costs often render such approaches prohibitive for the average researcher. Addressing this gap, we introduce the Collaborative Small Language Model Framework (CSLM), an innovative solution that combines a group of small-scaled, open-source LLMs to collaboratively produce QA pairs. Experiments on datasets of various domains show that CSLM unleashes the full potential of diverse small models to generate high-quality QA pairs, making it accessible to a broader range of researchers.- Anthology ID:
- 2024.findings-emnlp.690
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2024
- Month:
- November
- Year:
- 2024
- Address:
- Miami, Florida, USA
- Editors:
- Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 11816–11825
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.690/
- DOI:
- 10.18653/v1/2024.findings-emnlp.690
- Cite (ACL):
- Yiming Wang, Yang Liu, Lingchen Wang, and An Xiao. 2024. CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11816–11825, Miami, Florida, USA. Association for Computational Linguistics.
- Cite (Informal):
- CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models (Wang et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.690.pdf