Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification

Aleksandra Edwards, Asahi Ushio, Jose Camacho-collados, Helene Ribaupierre, Alun Preece


Abstract
Data augmentation techniques are widely used for enhancing the performance of machine learning models by tackling class imbalance issues and data sparsity. State-of-the-art generative language models have been shown to provide significant gains across different NLP tasks. However, their applicability to data augmentation for text classification tasks in few-shot settings have not been fully explored, especially for specialised domains. In this paper, we leverage GPT-2 (Radford et al, 2019) for generating artificial training instances in order to improve classification performance. Our aim is to analyse the impact the selection process of seed training examples has over the quality of GPT-generated samples and consequently the classifier performance. We propose a human-in-the-loop approach for selecting seed samples. Further, we compare the approach to other seed selection strategies that exploit the characteristics of specialised domains such as human-created class hierarchical structure and the presence of noun phrases. Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements and outperform competitive baselines. The seed selection strategies developed in this work lead to significant improvements over random seed selection for specialised domains. We show that guiding text generation through domain expert selection can lead to further improvements, which opens up interesting research avenues for combining generative models and active learning.
Anthology ID:
2022.dash-1.8
Volume:
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Eduard Dragut, Yunyao Li, Lucian Popa, Slobodan Vucetic, Shashank Srivastava
Venue:
DaSH
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
51–63
Language:
URL:
https://aclanthology.org/2022.dash-1.8
DOI:
Bibkey:
Cite (ACL):
Aleksandra Edwards, Asahi Ushio, Jose Camacho-collados, Helene Ribaupierre, and Alun Preece. 2022. Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification. In Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances), pages 51–63, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification (Edwards et al., DaSH 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.dash-1.8.pdf
Software:
 2022.dash-1.8.software.zip
Dataset:
 2022.dash-1.8.dataset.zip
Video:
 https://preview.aclanthology.org/emnlp-22-attachments/2022.dash-1.8.mp4