Mengying Wang
2026
Self-supervised Data Augmentation for Text Classification in Low-Data Settings
Deyu Ding | Mengying Wang | Andreas Spitz
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Deyu Ding | Mengying Wang | Andreas Spitz
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Due to data sparsity and high annotation cost, data augmentation has established itself as an effective tool for boosting model performance on supervised NLP tasks. Where task-agnostic augmentation methods tend to act as simple regularizers for the data, task-aware methods also leverage labels for the generation of data that are most suitable for downstream tasks. While prior work has investigated generation and sampling strategies individually, the potential of a self-supervised approach that leverages multiple pre-trained models in generation and sampling remains underexplored. To address this issue, we present an ensemble-based framework of language models that proposes augmentation candidates and internally reviews their suitability for low-resource text classification tasks. We evaluate our model on six classification benchmarks and find that it consistently outperforms state-of-the-art data augmentation baselines in classification accuracy by an average of 0.97 points in low-data scenarios.
2025
Quantifying the Risks of LLM- and Tool-assisted Rephrasing to Linguistic Diversity
Mengying Wang | Andreas Spitz
Findings of the Association for Computational Linguistics: EMNLP 2025
Mengying Wang | Andreas Spitz
Findings of the Association for Computational Linguistics: EMNLP 2025
Writing assistants and large language models see widespread use in the creation of text content. While their effectiveness for individual users has been evaluated in the literature, little is known about their proclivity to change language or reduce its richness when adopted by a large user base. In this paper, we take a first step towards quantifying this risk by measuring the semantic and vocabulary change enacted by the use of rephrasing tools on a multi-domain corpus of human-generated text.