CS/NLP at SemEval-2022 Task 4: Effective Data Augmentation Methods for Patronizing Language Detection and Multi-label Classification with RoBERTa and GPT3

Daniel Saeedi, Sirwe Saeedi, Aliakbar Panahi, Alvis C.M. Fong


Abstract
This paper presents a combination of data augmentation methods to boost the performance of state-of-the-art transformer-based language models for Patronizing and Condescending Language (PCL) detection and multi-label PCL classification tasks. These tasks are inherently different from sentiment analysis because positive/negative hidden attitudes in the context will not necessarily be considered positive/negative for PCL tasks. The oblation study observes that the imbalance degree of PCL dataset is in the extreme range. This paper presents a modified version of the sentence paraphrasing deep learning model (PEGASUS) to tackle the limitation of maximum sequence length. The proposed algorithm has no specific maximum input length to paraphrase sequences. Our augmented underrepresented class of annotated data achieved competitive results among top-16 SemEval-2022 participants. This paper’s approaches rely on fine-tuning pretrained RoBERTa and GPT3 models such as Davinci and Curie engines with an extra-enriched PCL dataset. Furthermore, we discuss Few-Shot learning technique to overcome the limitation of low-resource NLP problems.
Anthology ID:
2022.semeval-1.69
Volume:
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
SemEval
SIGs:
SIGLEX | SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
503–508
Language:
URL:
https://aclanthology.org/2022.semeval-1.69
DOI:
10.18653/v1/2022.semeval-1.69
Bibkey:
Cite (ACL):
Daniel Saeedi, Sirwe Saeedi, Aliakbar Panahi, and Alvis C.M. Fong. 2022. CS/NLP at SemEval-2022 Task 4: Effective Data Augmentation Methods for Patronizing Language Detection and Multi-label Classification with RoBERTa and GPT3. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 503–508, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
CS/NLP at SemEval-2022 Task 4: Effective Data Augmentation Methods for Patronizing Language Detection and Multi-label Classification with RoBERTa and GPT3 (Saeedi et al., SemEval 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.semeval-1.69.pdf