Data Augmentation for Multiclass Utterance Classification – A Systematic Study
Binxia Xu | Siyuan Qiu | Jie Zhang | Yafang Wang | Xiaoyu Shen | Gerard de Melo
Proceedings of the 28th International Conference on Computational Linguistics
Utterance classification is a key component in many conversational systems. However, classifying real-world user utterances is challenging, as people may express their ideas and thoughts in manifold ways, and the amount of training data for some categories may be fairly limited, resulting in imbalanced data distributions. To alleviate these issues, we conduct a comprehensive survey regarding data augmentation approaches for text classification, including simple random resampling, word-level transformations, and neural text generation to cope with imbalanced data. Our experiments focus on multi-class datasets with a large number of data samples, which has not been systematically studied in previous work. The results show that the effectiveness of different data augmentation schemes depends on the nature of the dataset under consideration.