Bharathi B


2022

pdf
PANDAS@TamilNLP-ACL2022: Emotion Analysis in Tamil Text using Language Agnostic Embeddings
Divyasri K | Gayathri G L | Krithika Swaminathan | Thenmozhi Durairaj | Bharathi B | Senthil Kumar B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

As the world around us continues to become increasingly digital, it has been acknowledged that there is a growing need for emotion analysis of social media content. The task of identifying the emotion in a given text has many practical applications ranging from screening public health to business and management. In this paper, we propose a language-agnostic model that focuses on emotion analysis in Tamil text. Our experiments yielded an F1-score of 0.010.

pdf
PANDAS@Abusive Comment Detection in Tamil Code-Mixed Data Using Custom Embeddings with LaBSE
Krithika Swaminathan | Divyasri K | Gayathri G L | Thenmozhi Durairaj | Bharathi B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Abusive language has lately been prevalent in comments on various social media platforms. The increasing hostility observed on the internet calls for the creation of a system that can identify and flag such acerbic content, to prevent conflict and mental distress. This task becomes more challenging when low-resource languages like Tamil, as well as the often-observed Tamil-English code-mixed text, are involved. The approach used in this paper for the classification model includes different methods of feature extraction and the use of traditional classifiers. We propose a novel method of combining language-agnostic sentence embeddings with the TF-IDF vector representation that uses a curated corpus of words as vocabulary, to create a custom embedding, which is then passed to an SVM classifier. Our experimentation yielded an accuracy of 52% and an F1-score of 0.54.

pdf
SSNCSE_NLP@TamilNLP-ACL2022: Transformer based approach for Emotion analysis in Tamil language
Bharathi B | Josephine Varsha
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Emotion analysis is the process of identifying and analyzing the underlying emotions expressed in textual data. Identifying emotions from a textual conversation is a challenging task due to the absence of gestures, vocal intonation, and facial expressions. Once the chatbots and messengers detect and report the emotions of the user, a comfortable conversation can be carried out with no misunderstandings. Our task is to categorize text into a predefined notion of emotion. In this thesis, it is required to classify text into several emotional labels depending on the task. We have adopted the transformer model approach to identify the emotions present in the text sequence. Our task is to identify whether a given comment contains emotion, and the emotion it stands for. The datasets were provided to us by the LT-EDI organizers (CITATION) for two tasks, in the Tamil language. We have evaluated the datasets using the pretrained transformer models and we have obtained the micro-averaged F1 scores as 0.19 and 0.12 for Task1 and Task 2 respectively.

pdf
SSNCSE NLP@TamilNLP-ACL2022: Transformer based approach for detection of abusive comment for Tamil language
Bharathi B | Josephine Varsha
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Social media platforms along with many other public forums on the Internet have shown a significant rise in the cases of abusive behavior such as Misogynism, Misandry, Homophobia, and Cyberbullying. To tackle these concerns, technologies are being developed and applied, as it is a tedious and time-consuming task to identify, report and block these offenders. Our task was to automate the process of identifying abusive comments and classify them into appropriate categories. The datasets provided by the DravidianLangTech@ACL2022 organizers were a code-mixed form of Tamil text. We trained the datasets using pre-trained transformer models such as BERT,m-BERT, and XLNET and achieved a weighted average of F1 scores of 0.96 for Tamil-English code mixed text and 0.59 for Tamil text.

pdf
Findings of the Shared Task on Multimodal Sentiment Analysis and Troll Meme Classification in Dravidian Languages
Premjith B | Bharathi Raja Chakravarthi | Malliga Subramanian | Bharathi B | Soman Kp | Dhanalakshmi V | Sreelakshmi K | Arunaggiri Pandian | Prasanna Kumaresan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents the findings of the shared task on Multimodal Sentiment Analysis and Troll meme classification in Dravidian languages held at ACL 2022. Multimodal sentiment analysis deals with the identification of sentiment from video. In addition to video data, the task requires the analysis of corresponding text and audio features for the classification of movie reviews into five classes. We created a dataset for this task in Malayalam and Tamil. The Troll meme classification task aims to classify multimodal Troll memes into two categories. This task assumes the analysis of both text and image features for making better predictions. The performance of the participating teams was analysed using the F1-score. Only one team submitted their results in the Multimodal Sentiment Analysis task, whereas we received six submissions in the Troll meme classification task. The only team that participated in the Multimodal Sentiment Analysis shared task obtained an F1-score of 0.24. In the Troll meme classification task, the winning team achieved an F1-score of 0.596.

pdf
SUH_ASR@LT-EDI-ACL2022: Transformer based Approach for Speech Recognition for Vulnerable Individuals in Tamil
Suhasini S | Bharathi B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

An Automatic Speech Recognition System is developed for addressing the Tamil conversational speech data of the elderly people andtransgender. The speech corpus used in this system is collected from the people who adhere their communication in Tamil at some primary places like bank, hospital, vegetable markets. Our ASR system is designed with pre-trained model which is used to recognize the speechdata. WER(Word Error Rate) calculation is used to analyse the performance of the ASR system. This evaluation could help to make acomparison of utterances between the elderly people and others. Similarly, the comparison between the transgender and other people isalso done. Our proposed ASR system achieves the word error rate as 39.65%.

pdf
SSNCSE_NLP@LT-EDI-ACL2022:Hope Speech Detection for Equality, Diversity and Inclusion using sentence transformers
Bharathi B | Dhanya Srinivasan | Josephine Varsha | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

In recent times, applications have been developed to regulate and control the spread of negativity and toxicity on online platforms. The world is filled with serious problems like political & religious conflicts, wars, pandemics, and offensive hate speech is the last thing we desire. Our task was to classify a text into ‘Hope Speech’ and ‘Non-Hope Speech’. We searched for datasets acquired from YouTube comments that offer support, reassurance, inspiration, and insight, and the ones that don’t. The datasets were provided to us by the LTEDI organizers in English, Tamil, Spanish, Kannada, and Malayalam. To successfully identify and classify them, we employed several machine learning transformer models such as m-BERT, MLNet, BERT, XLMRoberta, and XLM_MLM. The observed results indicate that the BERT and m-BERT have obtained the best results among all the other techniques, gaining a weighted F1- score of 0.92, 0.71, 0.76, 0.87, and 0.83 for English, Tamil, Spanish, Kannada, and Malayalam respectively. This paper depicts our work for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LTEDI 2021.

pdf
SSNCSE_NLP@LT-EDI-ACL2022: Homophobia/Transphobia Detection in Multiple Languages using SVM Classifiers and BERT-based Transformers
Krithika Swaminathan | Bharathi B | Gayathri G L | Hrishik Sampath
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Over the years, there has been a slow but steady change in the attitude of society towards different kinds of sexuality. However, on social media platforms, where people have the license to be anonymous, toxic comments targeted at homosexuals, transgenders and the LGBTQ+ community are not uncommon. Detection of homophobic comments on social media can be useful in making the internet a safer place for everyone. For this task, we used a combination of word embeddings and SVM Classifiers as well as some BERT-based transformers. We achieved a weighted F1-score of 0.93 on the English dataset, 0.75 on the Tamil dataset and 0.87 on the Tamil-English Code-Mixed dataset.

pdf
SSNCSE_NLP@LT-EDI-ACL2022: Speech Recognition for Vulnerable Individuals in Tamil using pre-trained XLSR models
Dhanya Srinivasan | Bharathi B | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Automatic speech recognition is a tool used to transform human speech into a written form. It is used in a variety of avenues, such as in voice commands, customer, service and more. It has emerged as an essential tool in the digitisation of daily life. It has been known to be of vital importance in making the lives of elderly and disabled people much easier. In this paper we describe an automatic speech recognition model, determined by using three pre-trained models, fine-tuned from the Facebook XLSR Wav2Vec2 model, which was trained using the Common Voice Dataset. The best model for speech recognition in Tamil is determined by finding the word error rate of the data. This work explains the submission made by SSNCSE_NLP in the shared task organized by LT-EDI at ACL 2022. A word error rate of 39.4512 is achieved.

pdf
Findings of the Shared Task on Speech Recognition for Vulnerable Individuals in Tamil
Bharathi B | Bharathi Raja Chakravarthi | Subalalitha Cn | Sripriya N | Arunaggiri Pandian | Swetha Valli
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

This paper illustrates the overview of the sharedtask on automatic speech recognition in the Tamillanguage. In the shared task, spontaneousTamil speech data gathered from elderly andtransgender people was given for recognitionand evaluation. These utterances were collected from people when they communicatedin the public locations such as hospitals, markets, vegetable shop, etc. The speech corpusincludes utterances of male, female, and transgender and was split into training and testingdata. The given task was evaluated using WER(Word Error Rate). The participants used thetransformer-based model for automatic speechrecognition. Different results using differentpre-trained transformer models are discussedin this overview paper.

2021

pdf
SSNCSE_NLP@DravidianLangTech-EACL2021: Offensive Language Identification on Multilingual Code Mixing Text
Bharathi B | Agnusimmaculate Silvia A
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Social networks made a huge impact in almost all fields in recent years. Text messaging through the Internet or cellular phones has become a major medium of personal and commercial communication. Everyday we have to deal with texts, emails or different types of messages in which there are a variety of attacks and abusive phrases. It is the moderator’s decision which comments to remove from the platform because of violations and which ones to keep but an automatic software for detecting abusive languages would be useful in recent days. In this paper we describe an automatic offensive language identification from Dravidian languages with various machine learning algorithms. This is work is shared task in DravidanLangTech-EACL2021. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media. This work explains the submissions made by SSNCSE_NLP in DravidanLangTech-EACL2021 Code-mix tasks for Offensive language detection. We achieve F1 scores of 0.95 for Malayalam, 0.7 for Kannada and 0.73 for task2-Tamil on the test-set.

pdf
SSNCSE_NLP@DravidianLangTech-EACL2021: Meme classification for Tamil using machine learning approach
Bharathi B | Agnusimmaculate Silvia A
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes. A meme is an image or video that represents the thoughts and feelings of a specific audience. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. A meme is a form of media that spreads an idea or emotion across the internet. The multi modal nature of memes, postings of hateful memes or related events like trolling, cyberbullying are increasing day by day. Memes make it even more challenging since they express humour and sarcasm in an implicit way, because of which the meme may not be offensive if we only consider the text or the image. In this paper we proposed a approach for meme classification for Tamil language that considers only the text present in the meme. This work explains the submissions made by SSNCSE NLP in DravidanLangTechEACL2021 task for meme classification in Tamil language. We achieve F1 scores of 0.50 using the proposed approach using the test-set.

2020

pdf
Semi-supervised Fine-grained Approach for Arabic dialect detection task
Nitin Nikamanth Appiah Balaji | Bharathi B
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Arabic being a language with numerous different dialects, it becomes extremely important to device a technique to distinguish each dialect efficiently. This paper focuses on the fine-grained country level and province level classification of Arabic dialects. The experiments in this paper are submissions done to the NADI 2020 shared Dialect detection task. Various text feature extraction techniques such as TF-IDF, AraVec, multilingual BERT and Fasttext embedding models are studied. We thereby, propose an approach of text embedding based model with macro average F1 score of 0.2232 for task1 and 0.0483 for task2, with the help of semi supervised learning approach.