Siva Sai


2021

pdf
Towards Offensive Language Identification for Dravidian Languages
Siva Sai | Yashvardhan Sharma
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Offensive speech identification in countries like India poses several challenges due to the usage of code-mixed and romanized variants of multiple languages by the users in their posts on social media. The challenge of offensive language identification on social media for Dravidian languages is harder, considering the low resources available for the same. In this paper, we explored the zero-shot learning and few-shot learning paradigms based on multilingual language models for offensive speech detection in code-mixed and romanized variants of three Dravidian languages - Malayalam, Tamil, and Kannada. We propose a novel and flexible approach of selective translation and transliteration to reap better results from fine-tuning and ensembling multilingual transformer networks like XLMRoBERTa and mBERT. We implemented pretrained, fine-tuned, and ensembled versions of XLM-RoBERTa for offensive speech classification. Further, we experimented with interlanguage, inter-task, and multi-task transfer learning techniques to leverage the rich resources available for offensive speech identification in the English language and to enrich the models with knowledge transfer from related tasks. The proposed models yielded good results and are promising for effective offensive speech identification in low resource settings.

2020

pdf
Siva at WNUT-2020 Task 2: Fine-tuning Transformer Neural Networks for Identification of Informative Covid-19 Tweets
Siva Sai
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Social media witnessed vast amounts of misinformation being circulated every day during the Covid-19 pandemic so much so that the WHO Director-General termed the phenomenon as “infodemic.” The ill-effects of such misinformation are multifarious. Thus, identifying and eliminating the sources of misinformation becomes very crucial, especially when mass panic can be controlled only through the right information. However, manual identification is arduous, with such large amounts of data being generated every day. This shows the importance of automatic identification of misinformative posts on social media. WNUT-2020 Task 2 aims at building systems for automatic identification of informative tweets. In this paper, I discuss my approach to WNUT-2020 Task 2. I fine-tuned eleven variants of four transformer networks -BERT, RoBERTa, XLM-RoBERTa, ELECTRA, on top of two different preprocessing techniques to reap good results. My top submission achieved an F1-score of 85.3% in the final evaluation.