Prompting-based Synthetic Data Generation for Few-Shot Question Answering

Maximilian Schmidt; Andrea Bartezzaghi; Ngoc Thang Vu

Prompting-based Synthetic Data Generation for Few-Shot Question Answering

Maximilian Schmidt, Andrea Bartezzaghi, Ngoc Thang Vu

Abstract

Although language models (LMs) have boosted the performance of Question Answering, they still need plenty of data. Data annotation, in contrast, is a time-consuming process. This especially applies to Question Answering, where possibly large documents have to be parsed and annotated with questions and their corresponding answers. Furthermore, Question Answering models often only work well for the domain they were trained on. Since annotation is costly, we argue that domain-agnostic knowledge from LMs, such as linguistic understanding, is sufficient to create a well-curated dataset. With this motivation, we show that using large language models can improve Question Answering performance on various datasets in the few-shot setting compared to state-of-the-art approaches. For this, we perform data generation leveraging the Prompting framework, suggesting that language models contain valuable task-agnostic knowledge that can be used beyond the common pre-training/fine-tuning scheme. As a result, we consistently outperform previous approaches on few-shot Question Answering.

Anthology ID:: 2024.lrec-main.1153
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 13168–13178
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2024.lrec-main.1153/
DOI:
Bibkey:
Cite (ACL):: Maximilian Schmidt, Andrea Bartezzaghi, and Ngoc Thang Vu. 2024. Prompting-based Synthetic Data Generation for Few-Shot Question Answering. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13168–13178, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Prompting-based Synthetic Data Generation for Few-Shot Question Answering (Schmidt et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2024.lrec-main.1153.pdf

PDF Cite Search Fix data