CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models

Yiming Wang, Yang Liu, Lingchen Wang, An Xiao


Abstract
Collecting high-quality question-answer (QA) pairs is vital for the training of large language models (LLMs), yet this process is traditionally laborious and time-intensive. With the rapid evolution of LLMs, the potential for leveraging these models to autonomously generate QA pairs has become apparent, particularly through the use of large-scale models like GPT-4. However, the computational demands and associated costs often render such approaches prohibitive for the average researcher. Addressing this gap, we introduce the Collaborative Small Language Model Framework (CSLM), an innovative solution that combines a group of small-scaled, open-source LLMs to collaboratively produce QA pairs. Experiments on datasets of various domains show that CSLM unleashes the full potential of diverse small models to generate high-quality QA pairs, making it accessible to a broader range of researchers.
Anthology ID:
2024.findings-emnlp.690
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11816–11825
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.690/
DOI:
10.18653/v1/2024.findings-emnlp.690
Bibkey:
Cite (ACL):
Yiming Wang, Yang Liu, Lingchen Wang, and An Xiao. 2024. CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11816–11825, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
CSLM: A Framework for Question Answering Dataset Generation through Collaborative Small Language Models (Wang et al., Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2024.findings-emnlp.690.pdf