Saranya Rajiakodi

2026

GYAAN-SAHIT: A Persona-Driven Multi-Agent Framework for Caste-Based Hate Speech Detection
Sakshi Gupta | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Ratnavel Rajalakshmi | Bharathi Raja Chakravarthi
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion

Social media has amplified public discourse in India while perpetuating caste-based hierarchies. Despite legal protections, caste-based hate speech continues to propagate across digital platforms through culturally embedded expressions that conventional classifiers often struggle to interpret. We propose GYAAN-SAHIT, a knowledge-driven multi-agent framework that addresses this problem through structured debate-based classification. Each agent adopts a distinct ideological and socio-cultural persona, engaging in multi-turn argumentation to reason over context, subtext, and intent. A critic agent then evaluates the coherence of the debate before producing the final classification. The framework further integrates Hindi hate lexicons to ground its reasoning in linguistic and cultural specificity. Experiments show that GYAAN-SAHIT achieves improvement in performance while generating culturally grounded explanations, demonstrating the effectiveness of persona-based multi-agent reasoning for hate speech detection in low-resource and socially complex environments.

pdf bib abs

Findings of Shared Task on Counter Narrative Generation on Homophobic and Transphobic Comments
Prasanna Kumar Kumaresan | Praveen Prasannan | Tanay Singh | Ruba Priyadharshini | Subalalitha Chinnaudayar Navaneethakrishnan | Saranya Rajiakodi | Paul Buitelaar | Bharathi Raja Chakravarthi
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion

Online platforms continue to witness harmful expressions targeting LGBTQ+ individuals, particularly in the form of homophobic and transphobic comments. While detection of such content has received substantial attention, generating constructive counter-narratives remains comparatively underexplored. In this shared task, we focus on counter-narrative generation in English and Tamil. Participants were provided with social media comments labeled as homophobic or transphobic and were required to generate respectful, contextually appropriate responses that challenge prejudice and promote empathy. Systems were evaluated using both reference-based metrics (Distinct-2 and BERTScore-F1) and rubric-based human evaluation metrics measuring politeness (PRS), quality (QS), and contextual coherence (CCNC). The results demonstrate variation in system performance across languages, with English systems showing stronger lexical diversity and Tamil systems excelling in politeness and contextual coherence. This paper presents dataset statistics, evaluation methodology, system performance analysis, and key observations from the shared task.

pdf bib abs

TamilPoliSent 2026: A Shared Task report on Multiclass Political Sentiment Analysis in Tamil
Mani Vegupatti | Kishore Kumar Ponnusamy | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Political sentiment analysis aims to automatically identify opinions and attitudes expressed in political discourse on social media platforms. This paper presents an overview of the TamilPoliSent 2026 shared task on multiclass political sentiment analysis in Tamil, organized as part of DravidianLangTech@ACL 2026. The task focuses on categorizing Tamil comments from X (formerly Twitter) into seven sentiment classes: Substantiated, Sarcastic, Opinionated, Positive, Negative, Neutral, and None of the above. The dataset consists of 5,440 annotated Tamil tweets collected from political discussions on social media. Participants were provided with labeled training and development datasets, while the test set was used for final evaluation.A total of 22 teams participated in the shared task and explored a wide range of modeling approaches including classical machine learning methods, transformer-based architectures, hybrid lexical–contextual models, and ensemble frameworks. System performance was evaluated using Macro F1-score to ensure balanced evaluation across all sentiment categories. The best-performing system achieved a Macro F1-score of 0.3935.The results highlight several challenges in Tamil political sentiment analysis, including class imbalance, sarcasm, informal writing styles, and semantic overlap between sentiment categories. The shared task demonstrates that transformer-based models combined with class-balanced learning and hybrid representations are effective for handling fine-grained political sentiment classification in low-resource languages. These findings contribute to advancing research in political discourse analysis and natural language processing for Tamil and other under-resourced languages.

pdf bib abs

This paper presents an overview of the Shared Task on Prompt Recovery for Large Language Models (LLMs) in Telugu, organized as part of DravidianLangTech @ ACL 2026. The task focuses on identifying the underlying communicative style of Telugu text excerpts, framed as a nine-class single-label classification problem covering Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative, and Persuasive tones. The dataset was constructed by collecting Telugu YouTube comments and generating style-modified variants using an LLM, resulting in 3,000 training instances, 300 validation samples, and 301 test samples. A total of 52 teams registered for the shared task, with 13 teams submitting valid system predictions. Systems explored diverse approaches, including transformer-based fine-tuning (IndicBERT, MuRIL, XLM-R), ensemble and stacking methods, pairwise modeling strategies, curriculum learning, and few-shot large language model prompting. Evaluation was conducted using Macro F1-score as the primary metric. The top-performing system achieved a Macro F1-score of 0.2987. Overall results indicate that Telugu prompt-style recovery remains a challenging problem, particularly due to stylistic overlap and high lexical similarity across classes.

pdf bib abs

Shared Task on Depression Detection from Malayalam and Tamil Speech Data
Jyothish Lal G | Premjith B | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Depression is one of the most common mental health problems in the world. It affects a person’s emotions, thinking, energy levels, and daily life. Early detection of depression is very important to provide timely support and treatment. While many studies focus on identifying depression from text, speech also carries important emotional and psychological signals that are often not fully explored. This paper presents an overview of the shared task on Depression Detection in Dravidian Languages (DD- DL). The task focuses on identifying signs of depression from speech data in two low-resource Dravidian languages: Tamil and Malayalam. Participants were provided with curated training datasets and were asked to build systems to classify speech samples as Depressed or Non-Depressed. The shared task includes two subtasks: (1) Depression detection in Tamil and (2) Depression detection in Malayalam. Participants applied various machine learning and deep learning approaches to model the acoustic and linguistic characteristics of speech. All submissions were evaluated using the macro-F1 score, which ensures fair performance measurement across classes.

pdf bib abs

Overview of the Shared Task on Multilevel Political Meme Classification in Tamil and Malayalam
Saranya Rajiakodi | Shunmuga Priya Muthusamy Chinnan | Premjith B | Subalalitha CN | Rahul Ponnusamy | Anshid K A | Bhuvaneswari Sivagnanam | Bharathi Raja Chakravarthi
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper presents an overview of the Multi-Level Political Meme Classification shared task conducted at DravidianLangTech–ACL 2026. The task introduces a hierarchical two-level classification framework for Tamil and Malayalam political memes: Level 1 focuses on stance detection (Support/Praise vs. Troll/Oppose), while Level 2 identifies the political target (individual or party), conditioned on the predicted stance. The dataset was curated from social media platforms and manually annotated with strong inter-annotator agreement. A total of 64 teams registered and 19 teams submitted their results using diverse multimodal approaches combining transformer-based text encoders, vision models, OCR pipelines, and hierarchical architectures. Results show that stance detection achieves high macro-F1 scores across both languages, whereas target identification remains more challenging, particularly in Malayalam. The findings highlight the importance of multimodal fusion, hierarchical reasoning, and robustness to OCR noise and class imbalance in political meme analysis.

pdf bib abs

From Comments to Harm: A Findings Report on Abusive Tamil Text Targeting Women on Social Media Shared Task
Bhuvaneswari Sivagnanam | Kathiravan Pannerselvam | Jananayagan | Charmathi Rajkumar | Ramesh Kannan R | Ratnavel Rajalakshmi | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Bharathi Raja Chakravarthi
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper presents an overview of the second shared task on Abusive Tamil Text Targeting Women on Social Media as a binary classification problem (abusive vs. non-abusive). We release a dataset of Tamil YouTube comments and evaluate submissions using macro-F1 to encourage balanced performance in a noisy, low-resource setting. There are 89 teams registered for this task and 24 teams submitted the results. The approaches used by the teams includes transformer fine-tuning, heterogeneous ensembles, classical baselines, and large language models using prompting and LoRA. Results show that the best-performing system scored 0.8297 macro-F1 and many submissions are around 0.79-0.81. Across submissions, transformer fine-tuning with domain-aligned encoders is consistently strong, while additional gains are frequently associated with Tamil-aware normalization and macro-F1-oriented calibration such as class-weighted learning and validation-based threshold tuning. Overall, the findings highlights the importance of language-aware preprocessing and careful decision calibration for reliable moderation of women-targeted abusive Tamil social media text.Disclaimer: This paper (including figures and examples) may contain offensive or harmful language, including abusive content targeting women. All such text is presented solely for research and educational purposes and it does not reflect the author’s views. Reader discretion is advised.

pdf bib

2025

pdf bib abs

Overview of the Shared Task on Detecting Racial Hoaxes in Code-Mixed Hindi-English Social Media Data
Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Shanu Dhawale | Saranya Rajiakodi | Sajeetha Thavareesan | Subalalitha Chinnaudayar Navaneethakrishnan | Thenmozhi Durairaj
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

The widespread use of social media has made it easier for false information to proliferate, particularly racially motivated hoaxes that can encourage violence and hatred. Such content is frequently shared in code-mixed languages in multilingual nations like India, which presents special difficulties for automated detection systems because of the casual language, erratic grammar, and rich cultural background. The shared task on detecting racial hoaxes in code mixed social media data aims to identify the racial hoaxes in Hindi-English data. It is a binary classification task with more than 5,000 labeled instances. A total of 11 teams participated in the task, and the results are evaluated using the macro-F1 score. The team that employed XLM-RoBERTa secured the first position in the task.

pdf bib abs

Findings of the Shared Task Caste and Migration Hate Speech Detection
Saranya Rajiakodi | Bharathi Raja Chakravarthi | Rahul Ponnusamy | Shunmuga Priya Muthusamy Chinnan | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy | Bhuvaneswari Sivagnanam | Balasubramanian Palani | Kogilavani Shanmugavadivel | Abirami Murugappan | Charmathi Rajkumar
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

Hate speech targeting caste and migration communities is a growing concern in online platforms, particularly in linguistically diverse regions. By focusing on Tamil language text content, this task provides a unique opportunity to tackle caste or migration related hate speech detection in a low resource language Tamil, contributing to a safer digital space. We present the results and main findings of the shared task caste and migration hate speech detection. The task is a binary classification determining whether a text is caste/migration related hate speech or not. The task attracted 17 participating teams, experimenting with a wide range of methodologies from traditional machine learning to advanced multilingual transformers. The top performing system achieved a macro F1-score of 0.88105, enhancing an ensemble of fine-tuned transformer models including XLM-R and MuRIL. Our analysis highlights the effectiveness of multilingual transformers in low resource, ensemble learning, and culturally informed socio political context based techniques.

pdf bib abs

Findings of the Shared Task Multilingual Bias and Propaganda Annotation in Political Discourse
Shunmuga Priya Muthusamy Chinnan | Bharathi Raja Chakravarthi | Meghann Drury-Grogan | Senthil Kumar B | Saranya Rajiakodi | Angel Deborah S
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

The Multilingual Bias and Propaganda Annotation task focuses on annotating biased and propagandist content in political discourse across English and Tamil. This paper presents the findings of the shared task on bias and propaganda annotation task. This task involves two sub tasks, one in English and another in Tamil, both of which are annotation task where a text comment is to be labeled. With a particular emphasis on polarizing policy debates such as the US Gender Policy and India’s Three Language Policy, this shared task invites participants to build annotation systems capable of labeling textual bias and propaganda. The dataset was curated by collecting comments from YouTube videos. Our curated dataset consists of 13,010 English sentences on US Gender Policy, Russia-Ukraine War and 5,880 Tamil sentences on Three Language Policy. Participants were instructed to annotate following the guidelines at sentence level with the bias labels that are fine-grained, domain specific and 4 propaganda labels. Participants were encouraged to leverage existing tools or develop novel approaches to perform fine-grained annotations that capture the complex socio-political nuances present in the data.

pdf bib abs

The increasing prevalence of misogynistic content in online memes has raised concerns about their impact on digital discourse. The culture specific images and informal usage of text in the memes present considerable challenges for the automatic detection systems, especially in low-resource languages. While previous shared tasks have addressed misogyny detection in English and several European languages, misogynistic meme detection in the Chinese has remained largely unexplored. To address this gap, we introduced a shared task focused on binary classification of Chinese language memes as misogynistic or non-misogynistic. The task featured memes collected from the Chinese social media and annotated by native speakers. A total of 45 teams registered, with 8 teams submitting predictions from their multimodal models integrating textual and visual features through diverse fusion strategies. The best-performing system achieved a macro F1-score of 0.93035, highlighting the effectiveness of lightweight pretrained encoder fusion. This system used the Chinese BERT and DenseNet-121 for text and image feature extraction, respectively. A feedforward network was trained as a classifier using the features obtained by concatenating text and image features.

pdf bib abs

Overview of the Shared Task on Multimodal Hate Speech Detection in Dravidian languages: DravidianLangTech@NAACL 2025
Jyothish Lal G | Premjith B | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Bharathi B | Rajeswari Natarajan | Ratnavel Rajalakshmi
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The detection of hate speech in social media platforms is very crucial these days. This is due to its adverse impact on mental health, social harmony, and online safety. This paper presents the overview of the shared task on Multimodal Hate Speech Detection in Dravidian Languages organized as part of DravidianLangTech@NAACL 2025. The task emphasizes detecting hate speech in social media content that combines speech and text. Here, we focus on three low-resource Dravidian languages: Malayalam, Tamil, and Telugu. Participants were required to classify hate speech in three sub-tasks, each corresponding to one of these languages. The dataset was curated by collecting speech and corresponding text from YouTube videos. Various machine learning and deep learning-based models, including transformer-based architectures and multimodal frameworks, were employed by the participants. The submissions were evaluated using the macro F1 score. Experimental results underline the potential of multimodal approaches in advancing hate speech detection for low-resource languages. Team SSNTrio achieved the highest F1 score in Malayalam and Tamil of 0.7511 and 0.7332, respectively. Team lowes scored the best F1 score of 0.3817 in the Telugu sub-task.

pdf bib abs

Overview on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Sathiyaraj Thangasamy | Ratnasingam Sakuntharaj | Prasanna Kumar Kumaresan | Kishore Kumar Ponnusamy | Arunaggiri Pandian Karunanidhi | Rohan R
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Political multiclass detection is the task of identifying the predefined seven political classes. In this paper, we report an overview of the findings on the “Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments” shared task conducted at the workshop on DravidianLangTech@NAACL 2025. The participants were provided with annotated Twitter comments, which are split into training, development, and unlabelled test datasets. A total of 139 participants registered for this shared task, and 25 teams finally submitted their results. The performance of the submitted systems was evaluated and ranked in terms of the macro-F1 score.

pdf bib abs

Findings of the Shared Task on Misogyny Meme Detection: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi | Rahul Ponnusamy | Saranya Rajiakodi | Shunmuga Priya Muthusamy Chinnan | Paul Buitelaar | Bhuvaneswari Sivagnanam | Anshid Kizhakkeparambil
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rapid expansion of social media has facilitated communication but also enabled the spread of misogynistic memes, reinforcing gender stereotypes and toxic online environments. Detecting such content is challenging due to the multimodal nature of memes, where meaning emerges from the interplay of text and images. The Misogyny Meme Detection shared task at DravidianLangTech@NAACL 2025 focused on Tamil and Malayalam, encouraging the development of multimodal approaches. With 114 teams registered and 23 submitting predictions, participants leveraged various pretrained language models and vision models through fusion techniques. The best models achieved high macro F1 scores (0.83682 for Tamil, 0.87631 for Malayalam), highlighting the effectiveness of multimodal learning. Despite these advances, challenges such as bias in the data set, class imbalance, and cultural variations persist. Future research should refine multimodal detection methods to improve accuracy and adaptability, fostering safer and more inclusive online spaces.

pdf bib abs

This overview paper presents the findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media, organized as part of DravidianLangTech@NAACL 2025. The task aimed to encourage the development of robust systems to detectabusive content targeting women in Tamil and Malayalam, two low-resource Dravidian languages. Participants were provided with annotated datasets containing abusive and nonabusive text curated from YouTube comments. We present an overview of the approaches and analyse the results of the shared task submissions. We believe the findings presented in this paper will be useful to researchers working in Dravidian language technology.

pdf bib

pdf bib abs

CUTN_Bio at BioLaySumm: Multi-Task Prompt Tuning with External Knowledge and Readability adaptation for Layman Summarization
Bhuvaneswari Sivagnanam | Rivo Krishnu C H | Princi Chauhan | Saranya Rajiakodi
Proceedings of the 24th Workshop on Biomedical Language Processing (Shared Tasks)

2024

pdf bib abs

Overview of Shared Task on Caste and Migration Hate Speech Detection
Saranya Rajiakodi | Bharathi Raja Chakravarthi | Rahul Ponnusamy | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy | Bhuvaneswari Sivagnanam | Charmathi Rajkumar
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

We present an overview of the first shared task on “Caste and Migration Hate Speech Detection.” The shared task is organized as part of LTEDI@EACL 2024. The system must delineate between binary outcomes, ascertaining whether the text is categorized as a caste/migration hate speech or not. The dataset presented in this shared task is in Tamil, which is one of the under-resource languages. There are a total of 51 teams participated in this task. Among them, 15 teams submitted their research results for the task. To the best of our knowledge, this is the first time the shared task has been conducted on textual hate speech detection concerning caste and migration. In this study, we have conducted a systematic analysis and detailed presentation of all the contributions of the participants as well as the statistics of the dataset, which is the social media comments in Tamil language to detect hate speech. It also further goes into the details of a comprehensive analysis of the participants’ methodology and their findings.

pdf bib abs

This paper offers a detailed overview of the first shared task on “Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes,” organized as part of the LT-EDI@EACL 2024 conference. The task was set to classify misogynistic content and troll memes within online platforms, focusing specifically on memes in Tamil and Malayalam languages. A total of 52 teams registered for the competition, with four submitting systems for the Tamil meme classification task and three for the Malayalam task. The outcomes of this shared task are significant, providing insights into the current state of misogynistic content in digital memes and highlighting the effectiveness of various computational approaches in identifying such detrimental content. The top-performing model got a macro F1 score of 0.73 in Tamil and 0.87 in Malayalam.

This paper provides a comprehensive summary of the “Homophobia and Transphobia Detection in Social Media Comments” shared task, which was held at the LT-EDI@EACL 2024. The objective of this task was to develop systems capable of identifying instances of homophobia and transphobia within social media comments. This challenge was extended across ten languages: English, Tamil, Malayalam, Telugu, Kannada, Gujarati, Hindi, Marathi, Spanish, and Tulu. Each comment in the dataset was annotated into three categories. The shared task attracted significant interest, with over 60 teams participating through the CodaLab platform. The submission of prediction from the participants was evaluated with the macro F1 score.

pdf bib abs

From Laughter to Inequality: Annotated Dataset for Misogyny Detection in Tamil and Malayalam Memes
Rahul Ponnusamy | Kathiravan Pannerselvam | Saranya Rajiakodi | Prasanna Kumar Kumaresan | Sajeetha Thavareesan | Bhuvaneswari Sivagnanam | Anshid K.A | Susminu S Kumar | Paul Buitelaar | Bharathi Raja Chakravarthi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this digital era, memes have become a prevalent online expression, humor, sarcasm, and social commentary. However, beneath their surface lies concerning issues such as the propagation of misogyny, gender-based bias, and harmful stereotypes. To overcome these issues, we introduced MDMD (Misogyny Detection Meme Dataset) in this paper. This article focuses on creating an annotated dataset with detailed annotation guidelines to delve into online misogyny within the Tamil and Malayalam-speaking communities. Through analyzing memes, we uncover the intricate world of gender bias and stereotypes in these communities, shedding light on their manifestations and impact. This dataset, along with its comprehensive annotation guidelines, is a valuable resource for understanding the prevalence, origins, and manifestations of misogyny in various contexts, aiding researchers, policymakers, and organizations in developing effective strategies to combat gender-based discrimination and promote equality and inclusivity. It enables a deeper understanding of the issue and provides insights that can inform strategies for cultivating a more equitable and secure online environment. This work represents a crucial step in raising awareness and addressing gender-based discrimination in the digital space.

pdf bib abs

This paper presents the findings of the shared task on multimodal sentiment analysis, abusive language detection and hate speech detection in Dravidian languages. Through this shared task, researchers worldwide can submit models for three crucial social media data analysis challenges in Dravidian languages: sentiment analysis, abusive language detection, and hate speech detection. The aim is to build models for deriving fine-grained sentiment analysis from multimodal data in Tamil and Malayalam, identifying abusive and hate content from multimodal data in Tamil. Three modalities make up the multimodal data: text, audio, and video. YouTube videos were gathered to create the datasets for the tasks. Thirty-nine teams took part in the competition. However, only two teams, though, turned in their findings. The macro F1-score was used to assess the submissions

pdf bib abs

Findings of the Shared Task on Hate and Offensive Language Detection in Telugu Codemixed Text (HOLD-Telugu)@DravidianLangTech 2024
Premjith B | Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Saranya Rajiakodi | Sai Prashanth Karnati | Sai Rishith Reddy Mangamuru | Chandu Janakiram
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper examines the submissions of various participating teams to the task on Hate and Offensive Language Detection in Telugu Codemixed Text (HOLD-Telugu) organized as part of DravidianLangTech 2024. In order to identify the contents containing harmful information in Telugu codemixed social media text, the shared task pushes researchers and academicians to build models. The dataset for the task was created by gathering YouTube comments and annotated manually. A total of 23 teams participated and submitted their results to the shared task. The rank list was created by assessing the submitted results using the macro F1-score.

pdf bib abs

SetFit: A Robust Approach for Offensive Content Detection in Tamil-English Code-Mixed Conversations Using Sentence Transfer Fine-tuning
Kathiravan Pannerselvam | Saranya Rajiakodi | Sajeetha Thavareesan | Sathiyaraj Thangasamy | Kishore Ponnusamy
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Code-mixed languages are increasingly prevalent on social media and online platforms, presenting significant challenges in offensive content detection for natural language processing (NLP) systems. Our study explores how effectively the Sentence Transfer Fine-tuning (Set-Fit) method, combined with logistic regression, detects offensive content in a Tamil-English code-mixed dataset. We compare our model’s performance with five other NLP models: Multilingual BERT (mBERT), LSTM, BERT, IndicBERT, and Language-agnostic BERT Sentence Embeddings (LaBSE). Our model, SetFit, outperforms these models in accuracy, achieving an impressive 89.72%, significantly higher than other models. These results suggest the sentence transformer model’s substantial potential for detecting offensive content in codemixed languages. Our study provides valuable insights into the sentence transformer model’s ability to identify various types of offensive material in Tamil-English online conversations, paving the way for more advanced NLP systems tailored to code-mixed languages.

2023

pdf bib abs

CSSCUTN@DravidianLangTech:Abusive comments Detection in Tamil and Telugu
Kathiravan Pannerselvam | Saranya Rajiakodi | Rahul Ponnusamy | Sajeetha Thavareesan
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Code-mixing is a word or phrase-level act of interchanging two or more languages during a conversation or in written text within a sentence. This phenomenon is widespread on social media platforms, and understanding the underlying abusive comments in a code-mixed sentence is a complex challenge. We present our system in our submission for the DravidianLangTech Shared Task on Abusive Comment Detection in Tamil and Telugu. Our approach involves building a multiclass abusive detection model that recognizes 8 different labels. The provided samples are code-mixed Tamil-English text, where Tamil is represented in romanised form. We focused on the Multiclass classification subtask, and we leveraged Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Our method exhibited its effectiveness in the shared task by earning the ninth rank out of all competing systems for the classification of abusive comments in the code-mixed text. Our proposed classifier achieves an impressive accuracy of 0.99 and an F1-score of 0.99 for a balanced dataset using TF-IDF with SVM. It can be used effectively to detect abusive comments in Tamil, English code-mixed text