This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
RahulPonnusamy
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
This overview paper presents the findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media, organized as part of DravidianLangTech@NAACL 2025. The task aimed to encourage the development of robust systems to detectabusive content targeting women in Tamil and Malayalam, two low-resource Dravidian languages. Participants were provided with annotated datasets containing abusive and nonabusive text curated from YouTube comments. We present an overview of the approaches and analyse the results of the shared task submissions. We believe the findings presented in this paper will be useful to researchers working in Dravidian language technology.
The rapid expansion of social media has facilitated communication but also enabled the spread of misogynistic memes, reinforcing gender stereotypes and toxic online environments. Detecting such content is challenging due to the multimodal nature of memes, where meaning emerges from the interplay of text and images. The Misogyny Meme Detection shared task at DravidianLangTech@NAACL 2025 focused on Tamil and Malayalam, encouraging the development of multimodal approaches. With 114 teams registered and 23 submitting predictions, participants leveraged various pretrained language models and vision models through fusion techniques. The best models achieved high macro F1 scores (0.83682 for Tamil, 0.87631 for Malayalam), highlighting the effectiveness of multimodal learning. Despite these advances, challenges such as bias in the data set, class imbalance, and cultural variations persist. Future research should refine multimodal detection methods to improve accuracy and adaptability, fostering safer and more inclusive online spaces.
The increasing prevalence of misogynistic content in online memes has raised concerns about their impact on digital discourse. The culture specific images and informal usage of text in the memes present considerable challenges for the automatic detection systems, especially in low-resource languages. While previous shared tasks have addressed misogyny detection in English and several European languages, misogynistic meme detection in the Chinese has remained largely unexplored. To address this gap, we introduced a shared task focused on binary classification of Chinese language memes as misogynistic or non-misogynistic. The task featured memes collected from the Chinese social media and annotated by native speakers. A total of 45 teams registered, with 8 teams submitting predictions from their multimodal models integrating textual and visual features through diverse fusion strategies. The best-performing system achieved a macro F1-score of 0.93035, highlighting the effectiveness of lightweight pretrained encoder fusion. This system used the Chinese BERT and DenseNet-121 for text and image feature extraction, respectively. A feedforward network was trained as a classifier using the features obtained by concatenating text and image features.
Hate speech targeting caste and migration communities is a growing concern in online platforms, particularly in linguistically diverse regions. By focusing on Tamil language text content, this task provides a unique opportunity to tackle caste or migration related hate speech detection in a low resource language Tamil, contributing to a safer digital space. We present the results and main findings of the shared task caste and migration hate speech detection. The task is a binary classification determining whether a text is caste/migration related hate speech or not. The task attracted 17 participating teams, experimenting with a wide range of methodologies from traditional machine learning to advanced multilingual transformers. The top performing system achieved a macro F1-score of 0.88105, enhancing an ensemble of fine-tuned transformer models including XLM-R and MuRIL. Our analysis highlights the effectiveness of multilingual transformers in low resource, ensemble learning, and culturally informed socio political context based techniques.
This paper presents the findings of the shared task on multimodal sentiment analysis, abusive language detection and hate speech detection in Dravidian languages. Through this shared task, researchers worldwide can submit models for three crucial social media data analysis challenges in Dravidian languages: sentiment analysis, abusive language detection, and hate speech detection. The aim is to build models for deriving fine-grained sentiment analysis from multimodal data in Tamil and Malayalam, identifying abusive and hate content from multimodal data in Tamil. Three modalities make up the multimodal data: text, audio, and video. YouTube videos were gathered to create the datasets for the tasks. Thirty-nine teams took part in the competition. However, only two teams, though, turned in their findings. The macro F1-score was used to assess the submissions
Users of social media platforms are negatively affected by the proliferation of hate or abusive content. There has been a rise in homophobic and transphobic content in recent years targeting LGBT+ individuals. The increasing levels of homophobia and transphobia online can make online platforms harmful and threatening for LGBT+ persons, potentially inhibiting equality, diversity, and inclusion. We are introducing a new dataset for three languages, namely Telugu, Kannada, and Gujarati. Additionally, we have created an expert-labeled dataset to automatically identify homophobic and transphobic content within comments collected from YouTube. We provided comprehensive annotation rules to educate annotators in this process. We collected approximately 10,000 comments from YouTube for all three languages. Marking the first dataset of these languages for this task, we also developed a baseline model with pre-trained transformers.
In this digital era, memes have become a prevalent online expression, humor, sarcasm, and social commentary. However, beneath their surface lies concerning issues such as the propagation of misogyny, gender-based bias, and harmful stereotypes. To overcome these issues, we introduced MDMD (Misogyny Detection Meme Dataset) in this paper. This article focuses on creating an annotated dataset with detailed annotation guidelines to delve into online misogyny within the Tamil and Malayalam-speaking communities. Through analyzing memes, we uncover the intricate world of gender bias and stereotypes in these communities, shedding light on their manifestations and impact. This dataset, along with its comprehensive annotation guidelines, is a valuable resource for understanding the prevalence, origins, and manifestations of misogyny in various contexts, aiding researchers, policymakers, and organizations in developing effective strategies to combat gender-based discrimination and promote equality and inclusivity. It enables a deeper understanding of the issue and provides insights that can inform strategies for cultivating a more equitable and secure online environment. This work represents a crucial step in raising awareness and addressing gender-based discrimination in the digital space.
This paper offers a detailed overview of the first shared task on “Multitask Meme Classification - Unraveling Misogynistic and Trolls in Online Memes,” organized as part of the LT-EDI@EACL 2024 conference. The task was set to classify misogynistic content and troll memes within online platforms, focusing specifically on memes in Tamil and Malayalam languages. A total of 52 teams registered for the competition, with four submitting systems for the Tamil meme classification task and three for the Malayalam task. The outcomes of this shared task are significant, providing insights into the current state of misogynistic content in digital memes and highlighting the effectiveness of various computational approaches in identifying such detrimental content. The top-performing model got a macro F1 score of 0.73 in Tamil and 0.87 in Malayalam.
We present an overview of the first shared task on “Caste and Migration Hate Speech Detection.” The shared task is organized as part of LTEDI@EACL 2024. The system must delineate between binary outcomes, ascertaining whether the text is categorized as a caste/migration hate speech or not. The dataset presented in this shared task is in Tamil, which is one of the under-resource languages. There are a total of 51 teams participated in this task. Among them, 15 teams submitted their research results for the task. To the best of our knowledge, this is the first time the shared task has been conducted on textual hate speech detection concerning caste and migration. In this study, we have conducted a systematic analysis and detailed presentation of all the contributions of the participants as well as the statistics of the dataset, which is the social media comments in Tamil language to detect hate speech. It also further goes into the details of a comprehensive analysis of the participants’ methodology and their findings.
In recent years, there has been a growing focus on Sentiment Analysis (SA) of code-mixed Dravidian languages. However, the majority of social media text in these languages is code-mixed, presenting a unique challenge. Despite this, there is currently lack of research on SA specifically tailored for code-mixed Dravidian languages, highlighting the need for further exploration and development in this domain. In this view, “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)- 2023 is organized. This shred consists two language tracks: code-mixed Tamil and Tulu and Tulu text is first ever explored in public domain for SA. We describe the task, its organization, and the submitted systems followed by the results. 57 research teams registered for the shared task and We received 27 systems each for code-mixed Tamil and Tulu texts. The performance of the systems (developed by participants) has been evaluated in terms of macro average F1 score. The top system for code-mixed Tamil and Tulu texts scored macro average F1 score of 0.32, and 0.542 respectively. The high quality and substantial quantity of submissions demonstrate a significant interest and attention in the analysis of code-mixed Dravidian languages. However, the current state of the art in this domain indicates the need for further advancements and improvements to effectively address the challenges posed by code-mixed Dravidian language SA.
Code-mixing is a word or phrase-level act of interchanging two or more languages during a conversation or in written text within a sentence. This phenomenon is widespread on social media platforms, and understanding the underlying abusive comments in a code-mixed sentence is a complex challenge. We present our system in our submission for the DravidianLangTech Shared Task on Abusive Comment Detection in Tamil and Telugu. Our approach involves building a multiclass abusive detection model that recognizes 8 different labels. The provided samples are code-mixed Tamil-English text, where Tamil is represented in romanised form. We focused on the Multiclass classification subtask, and we leveraged Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression (LR). Our method exhibited its effectiveness in the shared task by earning the ninth rank out of all competing systems for the classification of abusive comments in the code-mixed text. Our proposed classifier achieves an impressive accuracy of 0.99 and an F1-score of 0.99 for a balanced dataset using TF-IDF with SVM. It can be used effectively to detect abusive comments in Tamil, English code-mixed text
We present an overview of the second shared task on homophobia/transphobia Detection in social media comments. Given a comment, a system must predict whether or not it contains any form of homophobia/transphobia. The shared task included five languages: English, Spanish, Tamil, Hindi, and Malayalam. The data was given for two tasks. Task A was given three labels, and Task B fine-grained seven labels. In total, 75 teams enrolled for the shared task in Codalab. For task A, 12 teams submitted systems for English, eight teams for Tamil, eight teams for Spanish, and seven teams for Hindi. For task B, nine teams submitted for English, 7 teams for Tamil, 6 teams for Malayalam. We present and analyze all submissions in this paper.
Our research in Natural Language Processing (NLP) aims to detect hate speech comments specifically targeted at the LGBTQ+ community within the YouTube platform shared task conducted by LTEDI workshop. The dataset provided by the organizers exhibited a high degree of class imbalance, and to mitigate this, we employed NLPAUG, a data augmentation library. We employed several classification methods and reported the results using recall, precision, and F1-score metrics. The classification models discussed in this paper include a Bidirectional Long Short-Term Memory (BiLSTM) model trained with Word2Vec embeddings, a BiLSTM model trained with Twitter GloVe embeddings, transformer models such as BERT, DistiBERT, RoBERTa, and XLM-RoBERTa, all of which were trained and fine-tuned. We achieved a weighted F1-score of 0.699 on the test data and secured fifth place in task B with 7 classes for the English language.
Many people may find motivation in their lives by spreading content on social media that is encouraging or hopeful. Creating an effective model that helps in accurately predicting the target class is a challenging task. The problem of Hope speech identification is dealt with in this work using machine learning and deep learning methods. This paper presents the description of the system submitted by our team(VEL) to the Hope Speech Detection for Equality, Diversity, and Inclusion(HSD-EDI) LT-EDI-RANLP 2023 shared task for the Bulgarian language. The main goal of this shared task is to identify the given text into the Hope speech or Non-Hope speech category. The proposed method used the H2O deep learning model with MPNet embeddings and achieved the second rank for the Bulgarian language with the Macro F1 score of 0.69.
We propose a new dataset for detecting non-inclusive language in sentences in English. These sentences were gathered from public sites, explaining what is inclusive and what is non-inclusive. We also extracted potentially non-inclusive keywords/phrases from the guidelines from business websites. A phrase dictionary was created by using an automatic extension with a word embedding trained on a massive corpus of general English text. In the end, a phrase dictionary was constructed by hand-editing the previous one to exclude inappropriate expansions and add the keywords from the guidelines. In a business context, the words individuals use can significantly impact the culture of inclusion and the quality of interactions with clients and prospects. Knowing the right words to avoid helps customers of different backgrounds and historically excluded groups feel included. They can make it easier to have productive, engaging, and positive communications. You can find the dictionaries, the code, and the method for making requests for the corpus at (we will release the link for data and code once the paper is accepted).
Offensive content moderation is vital in social media platforms to support healthy online discussions. However, their prevalence in code-mixed Dravidian languages is limited to classifying whole comments without identifying part of it contributing to offensiveness. Such limitation is primarily due to the lack of annotated data for offensive spans. Accordingly, in this shared task, we provide Tamil-English code-mixed social comments with offensive spans. This paper outlines the dataset so released, methods, and results of the submitted systems.
Thirumurai, also known as Panniru Thirumurai, is a collection of Tamil Shaivite poems dating back to the Hindu revival period between the 6th and the 10th century. These poems are par excellence, in both literary and musical terms. They have been composed based on the ancient, now non-existent Tamil Pann system and can be set to music. We present a large dataset containing all the Thirumurai poems and also attempt to classify the Pann and author of each poem using transformer based architectures. Our work is the first of its kind in dealing with ancient Tamil text datasets, which are severely under-resourced. We explore several Deep Learning-based techniques for solving this challenge effectively and provide essential insights into the problem and how to address it.
Homophobia and Transphobia Detection is the task of identifying homophobia, transphobia, and non-anti-LGBT+ content from the given corpus. Homophobia and transphobia are both toxic languages directed at LGBTQ+ individuals that are described as hate speech. This paper summarizes our findings on the “Homophobia and Transphobia Detection in social media comments” shared task held at LT-EDI 2022 - ACL 2022 1. This shared taskfocused on three sub-tasks for Tamil, English, and Tamil-English (code-mixed) languages. It received 10 systems for Tamil, 13 systems for English, and 11 systems for Tamil-English. The best systems for Tamil, English, and Tamil-English scored 0.570, 0.870, and 0.610, respectively, on average macro F1-score.
Hope Speech detection is the task of classifying a sentence as hope speech or non-hope speech given a corpus of sentences. Hope speech is any message or content that is positive, encouraging, reassuring, inclusive and supportive that inspires and engenders optimism in the minds of people. In contrast to identifying and censoring negative speech patterns, hope speech detection is focussed on recognising and promoting positive speech patterns online. In this paper, we report an overview of the findings and results from the shared task on hope speech detection for Tamil, Malayalam, Kannada, English and Spanish languages conducted in the second workshop on Language Technology for Equality, Diversity and Inclusion (LT-EDI-2022) organised as a part of ACL 2022. The participants were provided with annotated training & development datasets and unlabelled test datasets in all the five languages. The goal of the shared task is to classify the given sentences into one of the two hope speech classes. The performances of the systems submitted by the participants were evaluated in terms of micro-F1 score and weighted-F1 score. The datasets for this challenge are openly available
Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil, Malayalam and Kannada languages are essential. As the user-generated content is more code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmarking studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize here the dataset for this challenge which are openly available at https://competitions.codalab.org/competitions/27654, and present an overview of the methods and the results of the competing systems.
This paper describes the IIITK’s team submissions to the hope speech detection for equality, diversity and inclusion in Dravidian languages shared task organized by LT-EDI 2021 workshop@EACL 2021. Our best configurations for the shared tasks achieve weighted F1 scores of 0.60 for Tamil, 0.83 for Malayalam, and 0.93 for English. We have secured ranks of 4, 3, 2 in Tamil, Malayalam and English respectively.