Self-supervised Data Augmentation for Text Classification in Low-Data Settings

Deyu Ding, Mengying Wang, Andreas Spitz


Abstract
Due to data sparsity and high annotation cost, data augmentation has established itself as an effective tool for boosting model performance on supervised NLP tasks. Where task-agnostic augmentation methods tend to act as simple regularizers for the data, task-aware methods also leverage labels for the generation of data that are most suitable for downstream tasks. While prior work has investigated generation and sampling strategies individually, the potential of a self-supervised approach that leverages multiple pre-trained models in generation and sampling remains underexplored. To address this issue, we present an ensemble-based framework of language models that proposes augmentation candidates and internally reviews their suitability for low-resource text classification tasks. We evaluate our model on six classification benchmarks and find that it consistently outperforms state-of-the-art data augmentation baselines in classification accuracy by an average of 0.97 points in low-data scenarios.
Anthology ID:
2026.lrec-main.788
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
10046–10056
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.788/
DOI:
Bibkey:
Cite (ACL):
Deyu Ding, Mengying Wang, and Andreas Spitz. 2026. Self-supervised Data Augmentation for Text Classification in Low-Data Settings. International Conference on Language Resources and Evaluation, main:10046–10056.
Cite (Informal):
Self-supervised Data Augmentation for Text Classification in Low-Data Settings (Ding et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.788.pdf