Sai Krishna Rallabandi


2021

pdf bib
Task-Specific Pre-Training and Cross Lingual Transfer for Sentiment Analysis in Dravidian Code-Switched Languages
Akshat Gupta | Sai Krishna Rallabandi | Alan W Black
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment analysis in Code-Mixed languages has garnered a lot of attention in recent years. It is an important task for social media monitoring and has many applications, as a large chunk of social media data is Code-Mixed. In this paper, we work on the problem of sentiment analysis for Dravidian Code-Switched languages - Tamil-Engish and Malayalam-English, using three different BERT based models. We leverage task-specific pre-training and cross-lingual transfer to improve on previously reported results, with significant improvement for the Tamil-Engish dataset. We also present a multilingual sentiment classification model that has competitive performance on both Tamil-English and Malayalam-English datasets.

pdf bib
Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching
Parul Chopra | Sai Krishna Rallabandi | Alan W Black | Khyathi Raghavi Chandu
Findings of the Association for Computational Linguistics: EMNLP 2021

Code-switching (CS), a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities still remains an understudied problem in language processing. The primary reasons behind this are: (1) minimal efforts in leveraging large pretrained multilingual models, and (2) the lack of annotated data. The distinguishing case of low performance of multilingual models in CS is the intra-sentence mixing of languages leading to switch points. We first benchmark two sequence labeling tasks – POS and NER on 4 different language pairs with a suite of pretrained models to identify the problems and select the best performing char-BERT model among them (addressing (1)). We then propose a self training method to repurpose the existing pretrained models using a switch-point bias by leveraging unannotated data (addressing (2)). We finally demonstrate that our approach performs well on both tasks by reducing the gap between the switch point performance while retaining the overall performance on two distinct language pairs in both the tasks. We plan to release our models and the code for all our experiments.

pdf bib
Unsupervised Self-Training for Sentiment Analysis of Code-Switched Data
Akshat Gupta | Sargam Menghani | Sai Krishna Rallabandi | Alan W Black
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Sentiment analysis is an important task in understanding social media content like customer reviews, Twitter and Facebook feeds etc. In multilingual communities around the world, a large amount of social media text is characterized by the presence of Code-Switching. Thus, it has become important to build models that can handle code-switched data. However, annotated code-switched data is scarce and there is a need for unsupervised models and algorithms. We propose a general framework called Unsupervised Self-Training and show its applications for the specific use case of sentiment analysis of code-switched data. We use the power of pre-trained BERT models for initialization and fine-tune them in an unsupervised manner, only using pseudo labels produced by zero-shot transfer. We test our algorithm on multiple code-switched languages and provide a detailed analysis of the learning dynamics of the algorithm with the aim of answering the question - ‘Does our unsupervised model understand the Code-Switched languages or does it just learn its representations?’. Our unsupervised models compete well with their supervised counterparts, with their performance reaching within 1-7% (weighted F1 scores) when compared to supervised models trained for a two class problem.

2020

pdf bib
A Resource for Computational Experiments on Mapudungun
Mingjun Duan | Carlos Fasola | Sai Krishna Rallabandi | Rodolfo Vega | Antonios Anastasopoulos | Lori Levin | Alan W Black
Proceedings of the 12th Language Resources and Evaluation Conference

We present a resource for computational experiments on Mapudungun, a polysynthetic indigenous language spoken in Chile with upwards of 200 thousand speakers. We provide 142 hours of culturally significant conversations in the domain of medical treatment. The conversations are fully transcribed and translated into Spanish. The transcriptions also include annotations for code-switching and non-standard pronunciations. We also provide baseline results on three core NLP tasks: speech recognition, speech synthesis, and machine translation between Spanish and Mapudungun. We further explore other applications for which the corpus will be suitable, including the study of code-switching, historical orthography change, linguistic structure, and sociological and anthropological studies.