Onkar Litake


2022

pdf
To Train or Not to Train: Predicting the Performance of Massively Multilingual Models
Shantanu Patankar | Omkar Gokhale | Onkar Litake | Aditya Mandke | Dipali Kadam
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation

pdf
PICT@DravidianLangTech-ACL2022: Neural Machine Translation On Dravidian Languages
Aditya Vyawahare | Rahul Tangsali | Aditya Mandke | Onkar Litake | Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents a summary of the findings that we obtained based on the shared task on machine translation of Dravidian languages. As a part of this shared task, we carried out neural machine translations for the following five language pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Malayalam, Kannada to Sanskrit, and Kannada to Tulu. The datasets for each of the five language pairs were used to train various translation models, including Seq2Seq models such as LSTM, bidirectional LSTM, Conv Seq2Seq, and training state-of-the-art as transformers from scratch, and fine-tuning already pre-trained models. For some models involving monolingual corpora, we implemented backtranslation as well. These models’ accuracy was later tested with a part of the same dataset using BLEU score as an evaluation metric.

pdf
Optimize_Prime@DravidianLangTech-ACL2022: Emotion Analysis in Tamil
Omkar Gokhale | Shantanu Patankar | Onkar Litake | Aditya Mandke | Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper aims to perform an emotion analysis of social media comments in Tamil. Emotion analysis is the process of identifying the emotional context of the text. In this paper, we present the findings obtained by Team Optimize_Prime in the ACL 2022 shared task “Emotion Analysis in Tamil.” The task aimed to classify social media comments into categories of emotion like Joy, Anger, Trust, Disgust, etc. The task was further divided into two subtasks, one with 11 broad categories of emotions and the other with 31 specific categories of emotion. We implemented three different approaches to tackle this problem: transformer-based models, Recurrent Neural Networks (RNNs), and Ensemble models. XLM-RoBERTa performed the best on the first task with a macro-averaged f1 score of 0.27, while MuRIL provided the best results on the second task with a macro-averaged f1 score of 0.13.

pdf
Optimize_Prime@DravidianLangTech-ACL2022: Abusive Comment Detection in Tamil
Shantanu Patankar | Omkar Gokhale | Onkar Litake | Aditya Mandke | Dipali Kadam
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper tries to address the problem of abusive comment detection in low-resource indic languages. Abusive comments are statements that are offensive to a person or a group of people. These comments are targeted toward individuals belonging to specific ethnicities, genders, caste, race, sexuality, etc. Abusive Comment Detection is a significant problem, especially with the recent rise in social media users. This paper presents the approach used by our team — Optimize_Prime, in the ACL 2022 shared task “Abusive Comment Detection in Tamil.” This task detects and classifies YouTube comments in Tamil and Tamil-English Codemixed format into multiple categories. We have used three methods to optimize our results: Ensemble models, Recurrent Neural Networks, and Transformers. In the Tamil data, MuRIL and XLM-RoBERTA were our best performing models with a macro-averaged f1 score of 0.43. Furthermore, for the Code-mixed data, MuRIL and M-BERT provided sublime results, with a macro-averaged f1 score of 0.45.

pdf
L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models
Onkar Litake | Maithili Ravindra Sabane | Parth Sachin Patil | Aparna Abhijeet Ranade | Raviraj Joshi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Named Entity Recognition (NER) is a basic NLP task and finds major applications in conversational and search systems. It helps us identify key entities in a sentence used for the downstream application. NER or similar slot filling systems for popular languages have been heavily used in commercial applications. In this work, we focus on Marathi, an Indian language, spoken prominently by the people of Maharashtra state. Marathi is a low resource language and still lacks useful NER resources. We present L3Cube-MahaNER, the first major gold standard named entity recognition dataset in Marathi. We also describe the manual annotation guidelines followed during the process. In the end, we benchmark the dataset on different CNN, LSTM, and Transformer based models like mBERT, XLM-RoBERTa, IndicBERT, MahaBERT, etc. The MahaBERT provides the best performance among all the models. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .