NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset
Sana Al-Azzawi, György Kovács, Filip Nilsson, Tosin Adewumi, Marcus Liwicki
Abstract
In this paper, we propose a methodology fortask 10 of SemEval23, focusing on detectingand classifying online sexism in social me-dia posts. The task is tackling a serious is-sue, as detecting harmful content on socialmedia platforms is crucial for mitigating theharm of these posts on users. Our solutionfor this task is based on an ensemble of fine-tuned transformer-based models (BERTweet,RoBERTa, and DeBERTa). To alleviate prob-lems related to class imbalance, and to improvethe generalization capability of our model, wealso experiment with data augmentation andsemi-supervised learning. In particular, fordata augmentation, we use back-translation, ei-ther on all classes, or on the underrepresentedclasses only. We analyze the impact of thesestrategies on the overall performance of thepipeline through extensive experiments. whilefor semi-supervised learning, we found thatwith a substantial amount of unlabelled, in-domain data available, semi-supervised learn-ing can enhance the performance of certainmodels. Our proposed method (for which thesource code is available on Github12) attainsan F 1-score of 0.8613 for sub-taskA, whichranked us 10th in the competition.- Anthology ID:
- 2023.semeval-1.196
- Volume:
- Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Atul Kr. Ojha, A. Seza Doğruöz, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, Elisa Sartori
- Venue:
- SemEval
- SIG:
- SIGLEX
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1421–1427
- Language:
- URL:
- https://aclanthology.org/2023.semeval-1.196
- DOI:
- 10.18653/v1/2023.semeval-1.196
- Cite (ACL):
- Sana Al-Azzawi, György Kovács, Filip Nilsson, Tosin Adewumi, and Marcus Liwicki. 2023. NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1421–1427, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- NLP-LTU at SemEval-2023 Task 10: The Impact of Data Augmentation and Semi-Supervised Learning Techniques on Text Classification Performance on an Imbalanced Dataset (Al-Azzawi et al., SemEval 2023)
- PDF:
- https://preview.aclanthology.org/ingest-acl-2023-videos/2023.semeval-1.196.pdf