Asha Hegde


Corpus Creation for Sentiment Analysis in Code-Mixed Tulu Text
Asha Hegde | Mudoor Devadas Anusha | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha | Bharathi Raja Chakravarthi
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

Sentiment Analysis (SA) employing code-mixed data from social media helps in getting insights to the data and decision making for various applications. One such application is to analyze users’ emotions from comments of videos on YouTube. Social media comments do not adhere to the grammatical norms of any language and they often comprise a mix of languages and scripts. The lack of annotated code-mixed data for SA in a low-resource language like Tulu makes the SA a challenging task. To address the lack of annotated code-mixed Tulu data for SA, a gold standard trlingual code-mixed Tulu annotated corpus of 7,171 YouTube comments is created. Further, Machine Learning (ML) algorithms are employed as baseline models to evaluate the developed dataset and the performance of the ML algorithms are found to be encouraging.

MUCS@DravidianLangTech@ACL2022: Ensemble of Logistic Regression Penalties to Identify Emotions in Tamil Text
Asha Hegde | Sharal Coelho | Hosahalli Shashirekha
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Emotion Analysis (EA) is the process of automatically analyzing and categorizing the input text into one of the predefined sets of emotions. In recent years, people have turned to social media to express their emotions, opinions or feelings about news, movies, products, services, and so on. These users’ emotions may help the public, governments, business organizations, film producers, and others in devising strategies, making decisions, and so on. The increasing number of social media users and the increasing amount of user generated text containing emotions on social media demands automated tools for the analysis of such data as handling this data manually is labor intensive and error prone. Further, the characteristics of social media data makes the EA challenging. Most of the EA research works have focused on English language leaving several Indian languages including Tamil unexplored for this task. To address the challenges of EA in Tamil texts, in this paper, we - team MUCS, describe the model submitted to the shared task on Emotion Analysis in Tamil at DravidianLangTech@ACL 2022. Out of the two subtasks in this shared task, our team submitted the model only for Task a. The proposed model comprises of an Ensemble of Logistic Regression (LR) classifiers with three penalties, namely: L1, L2, and Elasticnet. This Ensemble model trained with Term Frequency - Inverse Document Frequency (TF-IDF) of character bigrams and trigrams secured 4th rank in Task a with a macro averaged F1-score of 0.04. The code to reproduce the proposed models is available in github1.

Overview of the Shared Task on Machine Translation in Dravidian Languages
Anand Kumar Madasamy | Asha Hegde | Shubhanker Banerjee | Bharathi Raja Chakravarthi | Ruba Priyadharshini | Hosahalli Shashirekha | John McCrae
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents an outline of the shared task on translation of under-resourced Dravidian languages at DravidianLangTech-2022 workshop to be held jointly with ACL 2022. A description of the datasets used, approach taken for analysis of submissions and the results have been illustrated in this paper. Five sub-tasks organized as a part of the shared task include the following translation pairs: Kannada to Tamil, Kannada to Telugu, Kannada to Sanskrit, Kannada to Malayalam and Kannada to Tulu. Training, development and test datasets were provided to all participants and results were evaluated on the gold standard datasets. A total of 16 research groups participated in the shared task and a total of 12 submission runs were made for evaluation. Bilingual Evaluation Understudy (BLEU) score was used for evaluation of the translations.

MUCS@Text-LT-EDI@ACL 2022: Detecting Sign of Depression from Social Media Text using Supervised Learning Approach
Asha Hegde | Sharal Coelho | Ahmad Elyas Dashti | Hosahalli Shashirekha
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Social media has seen enormous growth in its users recently and knowingly or unknowingly the behavior of a person will be reflected in the comments she/he posts on social media. Users having the sign of depression may post negative or disturbing content seeking the attention of other users. Hence, social media data can be analysed to check whether the users’ have the sign of depression and help them to get through the situation if required. However, as analyzing the increasing amount of social media data manually in laborious and error-prone, automated tools have to be developed for the same. To address the issue of detecting the sign of depression content on social media, in this paper, we - team MUCS, describe an Ensemble of Machine Learning (ML) models and a Transfer Learning (TL) model submitted to “Detecting Signs of Depression from Social Media Text-LT-EDI@ACL 2022” (DepSign-LT-EDI@ACL-2022) shared task at Association for Computational Linguistics (ACL) 2022. Both frequency and text based features are used to train an Ensemble model and Bidirectional Encoder Representations from Transformers (BERT) fine-tuned with raw text is used to train the TL model. Among the two models, the TL model performed better with a macro averaged F-score of 0.479 and placed 18th rank in the shared task. The code to reproduce the proposed models is available in github page1.

MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text
Asha Hegde | Shashirekha Lakshmaiah
Proceedings of the Seventh Conference on Machine Translation (WMT)

Code-mixing is the phenomena of mixing various linguistic units such as paragraphs, sentences, phrases, words, etc., of one language with that of the other language in any text. This code-mixing is predominantly used by social media users who know more than one language. Processing code-mixed text is challenging because of its characteristics and lack of tools that supports such data. Further, pretrained models can be used for the formal text and not for the informal text such as code-mixed. Developing efficient Machine Translation (MT) systems for code-mixed text is challenging due to lack of code-mixed training data. Further, existing MT systems developed to translate monolingual data are not portable to translate code-mixed text mainly due to its informal nature. To address the MT challenges of code-mixed text, this paper describes the proposed MT models submitted by our team MUCS, to the Code-mixed Machine Translation (MixMT) shared task in the Workshop on Machine Translation (WMT) organized in connection with Empirical models in Natural Language Processing (EMNLP) 2022. This shared has two subtasks: i) subtask 1 - to translate English sentences and their corresponding Hindi translations into Hinglish text and ii) subtask 2 - to translate Hinglish text into English text. The proposed models that translate the code-mixed English text to Hinglish (English-Hindli code-mixed text) and vice-versa, comprises of i) transliterating Hinglish text from Latin to Devanagari script and vice-versa, ii) pseudo translation generation using existing models, and iii) efficient target generation by combining the pseudo translations along with the training data provided by the shared task organizers. The proposed models obtained 5{textsuperscript{th} and 3{textsuperscript{rd} rank with Recall-Oriented Under-study for Gisting Evaluation (ROUGE) scores of 0.35806 and 0.55453 for subtask 1 and subtask 2 respectively.


MUCS@ - Machine Translation for Dravidian Languages using Stacked Long Short Term Memory
Asha Hegde | Ibrahim Gashaw | Shashirekha H.l.
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Dravidian language family is one of the largest language families in the world. In spite of its uniqueness, Dravidian languages have gained very less attention due to scarcity of resources to conduct language technology tasks such as translation, Parts-of-Speech tagging, Word Sense Disambiguation etc,. In this paper, we, team MUCS, describe sequence-to-sequence stacked Long Short Term Memory (LSTM) based Neural Machine Translation (NMT) model submitted to “Machine Translation in Dravidian languages”, a shared task organized by EACL-2021. The NMT model was applied on translation using English-Tamil, EnglishTelugu, English-Malayalam and Tamil-Telugu corpora provided by the organizers. Standard evaluation metrics namely Bilingual Evaluation Understudy (BLEU) and human evaluations are used to evaluate the model. Our models exhibited good accuracy for all the language pairs and obtained 2nd rank for TamilTelugu language pair.

MUM at ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Using Supervised Learning Approaches
Asha Hegde | Mudoor Devadas Anusha | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

Due to the rapid rise of social networks and micro-blogging websites, communication between people from different religion, caste, creed, cultural and psychological backgrounds has become more direct leading to the increase in cyber conflicts between people. This in turn has given rise to more and more hate speech and usage of abusive words to the point that it has become a serious problem creating negative impacts on the society. As a result, it is imperative to identify and filter such content on social media to prevent its further spread and the damage it is going to cause. Further, filtering such huge data requires automated tools since doing it manually is labor intensive and error prone. Added to this is the complex code-mixed and multi-scripted nature of social media text. To address the challenges of abusive content detection on social media, in this paper, we, team MUM, propose Machine Learning (ML) and Deep Learning (DL) models submitted to Multilingual Gender Biased and Communal Language Identification (ComMA@ICON) shared task at International Conference on Natural Language Processing (ICON) 2021. Word uni-grams, char n-grams, and emoji vectors are combined as features to train a ML Elastic-net regression model and multi-lingual Bidirectional Encoder Representations from Transformers (mBERT) is fine-tuned for a DL model. Out of the two, fine-tuned mBERT model performed better with an instance-F1 score of 0.326, 0.390, 0.343, 0.359 for Meitei, Bangla, Hindi, Multilingual texts respectively.


MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation
Asha Hegde | H.l. Shashirekha
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

Machine Translation (MT) is the task of automatically converting the text in source language to text in target language by preserving the meaning. MT usually require large corpus for training the translation models. Due to scarcity of resources very less attention is given to translating into low resource languages and in particular into Indic languages. In this direction, a shared task called “Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation” is organized to illustrate the capability of general domain MT when translating into Indic languages and low resource domain adaptation of MT systems. In this paper, we, team MUCS, describe a simple word extraction based domain adaptation approach applied to English-Hindi MT only. MT in the proposed model is carried out using Open-NMT - a popular Neural Machine Translation tool. A general domain corpus is built effectively combining the available English-Hindi corpora and removing the duplicate sentences. Further, domain specific corpus is updated by extracting the sentences from generic corpus that contains the words given in the domain specific corpus. The proposed model exhibited satisfactory results for small domain specific AI and CHE corpora provided by the organizers in terms of BLEU score with 1.25 and 2.72 respectively. Further, this methodology is quite generic and can easily be extended to other low resource language pairs as well.