pdf
bib
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi
|
Ruba Priyadharshini
|
Anand Kumar Madasamy
|
Sajeetha Thavareesan
|
Elizabeth Sherly
|
Saranya Rajiakodi
|
Balasubramanian Palani
|
Malliga Subramanian
|
Subalalitha Cn
|
Dhivya Chinnappa
pdf
bib
abs
Incepto@DravidianLangTech 2025: Detecting Abusive Tamil and Malayalam Text Targeting Women on YouTube
Luxshan Thavarasa
|
Sivasuthan Sukumar
|
Jubeerathan Thevakumar
This study introduces a novel multilingualmodel designed to effectively address the challenges of detecting abusive content in low resource, code-mixed languages, where limiteddata availability and the interplay of mixed languages, leading to complex linguistic phenomena, create significant hurdles in developingrobust machine learning models. By leveraging transfer learning techniques and employingmulti-head attention mechanisms, our modeldemonstrates impressive performance in detecting abusive content in both Tamil and Malayalam datasets. On the Tamil dataset, our teamachieved a macro F1 score of 0.7864, whilefor the Malayalam dataset, a macro F1 score of0.7058 was attained. These results highlight theeffectiveness of our multilingual approach, delivering strong performance in Tamil and competitive results in Malayalam.
pdf
bib
abs
Eureka-CIOL@DravidianLangTech 2025: Using Customized BERTs for Sentiment Analysis of Tamil Political Comments
Enjamamul Haque Eram
|
Anisha Ahmed
|
Sabrina Afroz Mitu
|
Azmine Toushik Wasi
Sentiment analysis on social media platforms plays a crucial role in understanding public opinion and the decision-making process on political matters. As a significant number of individuals express their views on social media, analyzing these opinions is essential for monitoring political trends and assessing voter sentiment. However, sentiment analysis for low-resource languages, such as Tamil, presents considerable challenges due to the limited availability of annotated datasets and linguistic complexities. To address this gap, we utilize a novel dataset encompassing seven sentiment classes, offering a unique opportunity to explore sentiment variations in Tamil political discourse. In this study, we evaluate multiple pre-trained models from the Hugging Face library and experiment with various hyperparameter configurations to optimize model performance. Our findings aim to contribute to the development of more effective sentiment analysis tools tailored for low-resource languages, ultimately empowering Tamil-speaking communities by providing deeper insights into their political sentiments. Our full experimental codebase is publicly available at: ciol-researchlab/NAACL25-Eureka-Sentiment-Analysis-Tamil
pdf
bib
abs
Akatsuki-CIOL@DravidianLangTech 2025: Ensemble-Based Approach Using Pre-Trained Models for Fake News Detection in Dravidian Languages
Mahfuz Ahmed Anik
|
Md. Iqramul Hoque
|
Wahid Faisal
|
Azmine Toushik Wasi
|
Md Manjurul Ahsan
The widespread spread of fake news on social media poses significant challenges, particularly for low-resource languages like Malayalam. The accessibility of social platforms accelerates misinformation, leading to societal polarization and poor decision-making. Detecting fake news in Malayalam is complex due to its linguistic diversity, code-mixing, and dialectal variations, compounded by the lack of large labeled datasets and tailored models. To address these, we developed a fine-tuned transformer-based model for binary and multiclass fake news detection. The binary classifier achieved a macro F1 score of 0.814, while the multiclass model, using multimodal embeddings, achieved a score of 0.1978. Our system ranked 14th and 11th in the shared task competition, highlighting the need for specialized techniques in underrepresented languages. Our full experimental codebase is publicly available at: ciol-researchlab/NAACL25-Akatsuki-Fake-News-Detection.
pdf
bib
abs
RMKMavericks@DravidianLangTech 2025: Tackling Abusive Tamil and Malayalam Text Targeting Women: A Linguistic Approach
Sandra Johnson
|
Boomika E
|
Lahari P
Social media abuse of women is a widespread problem, especially in regional languages like Tamil and Malayalam, where there are few tools for automated identification. The use of machine learning methods to detect abusive messages in several languages is examined in this work. An external dataset was used to train a Support Vector Machine (SVM) model for Tamil, which produced an F1 score of 0.6196. Using the given dataset, a Multinomial Naive Bayes (MNB) model was trained for Malayalam, obtaining an F1 score of 0.6484. Both models processed and analyzed textual input efficiently by using TF-IDF vectorization for feature extraction. This method shows the ability to solve the linguistic diversity and complexity of abusive language identification by utilizing language-specific datasets and customized algorithms. The results highlight how crucial it is to use focused machine learning techniques to make online spaces safer for women, especially when speaking minority languages.
pdf
bib
abs
RMKMavericks@DravidianLangTech 2025: Emotion Mining in Tamil and Tulu Code-Mixed Text: Challenges and Insights
Gladiss Merlin N.r
|
Boomika E
|
Lahari P
Sentiment analysis in code-mixed social media comments written in Tamil and Tulu presents unique challenges due to grammatical inconsistencies, code-switching, and the use of non-native scripts. To address these complexities, we employ pre-processing techniques for text cleaning and evaluate machine learning models tailored for sentiment detection. Traditional machine learning methods combined with feature extraction strategies, such as TF- IDF, are utilized. While logistic regression demonstrated reasonable performance on the Tamil dataset, achieving a macro F1 score of 0.44, support vector machines (SVM) outperformed logistic regression on the Tulu dataset with a macro F1 score of 0.54. These results demonstrate the effectiveness of traditional approaches, particularly SVM, in handling low- resource, multilingual data, while also high- lighting the need for further refinement to improve performance across underrepresented sentiment classes.
pdf
bib
abs
JAS@DravidianLangTech 2025: Abusive Tamil Text targeting Women on Social Media
B Saathvik
|
Janeshvar Sivakumar
|
Thenmozhi Durairaj
This paper presents our submission for Abusive Comment Detection in Tamil - DravidianLangTech@NAACL 2025. The aim is to classify whether a given comment is abusive towards women. Google’s MuRIL (Khanujaet al., 2021), a transformer-based multilingual model, is fine-tuned using the provided dataset to build the classification model. The datasetis preprocessed, tokenised, and formatted for model training. The model is trained and evaluated using accuracy, F1-score, precision, andrecall. Our approach achieved an evaluation accuracy of 77.76% and an F1-score of 77.65%. The lack of large, high-quality datasets forlow-resource languages has also been acknowledged.
pdf
bib
abs
Team-Risers@DravidianLangTech 2025: AI-Generated Product Review Detection in Dravidian Languages Using Transformer-Based Embeddings
Sai Sathvik
|
Muralidhar Palli
|
Keerthana NNL
|
Balasubramanian Palani
|
Jobin Jose
|
Siranjeevi Rajamanickam
Online product reviews influence customer choices and company reputations. However, companies can counter negative reviews by generating fake reviews that portray their products positively. These fake reviews lead to legal disputes and concerns, particularly because AI detection tools are limited in low-resource languages such as Tamil and Malayalam. To address this, we use machine learning and deep learning techniques to identify AI-generated reviews. We utilize Tamil BERT and Malayalam BERT in the embedding layer to extract contextual features. These features are sent to a Feedforward Neural Network (FFN) with softmax to classify reviews as AI-generated or not. The performance of the model is evaluated on the dataset. The results show that the transformer-based embedding achieves a better accuracy of 95.68\% on Tamil data and an accuracy of 88.75\% on Malayalam data.
pdf
bib
abs
NLPopsCIOL@DravidianLangTech 2025: Classification of Abusive Tamil and Malayalam Text Targeting Women Using Pre-trained Models
Abdullah Al Nahian
|
Mst Rafia Islam
|
Azmine Toushik Wasi
|
Md Manjurul Ahsan
Hate speech detection in multilingual and code-mixed contexts remains a significant challenge due to linguistic diversity and overlapping syntactic structures. This paper presents a study on the detection of hate speech in Tamil and Malayalam using transformer-based models. Our goal is to address underfitting and develop effective models for hate speech classification. We evaluate several pre-trained models, including MuRIL and XLM-RoBERTa, and show that fine-tuning is crucial for better performance. The test results show a Macro-F1 score of 0.7039 for Tamil and 0.6402 for Malayalam, highlighting the promise of these models with further improvements in fine-tuning. We also discuss data preprocessing techniques, model implementations, and experimental findings. Our full experimental codebase is publicly available at: github.com/ciol-researchlab/NAACL25-NLPops-Classification-Abusive-Text.
pdf
bib
abs
AiMNLP@DravidianLangTech 2025: Unmask It! AI-Generated Product Review Detection in Dravidian Languages
Somsubhra De
|
Advait Vats
The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and Malayalam-BERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Fake News Detection in Dravidian Languages Using Transliteration-Aware XLM-RoBERTa and Transformer Encoder-Decoder
Durga Prasad Manukonda
|
Rohith Gowtham Kodali
This study addresses the challenge of fake news detection in code-mixed and transliterated text, focusing on a multilingual setting with significant linguistic variability. A novel approach is proposed, leveraging a fine-tuned multilingual transformer model trained using Masked Language Modeling on a dataset that includes original, fully transliterated, and partially transliterated text. The fine-tuned embeddings are integrated into a custom transformer classifier designed to capture complex dependencies in multilingual sequences. The system achieves state-of-the-art performance, demonstrating the effectiveness of combining transliteration-aware fine-tuning with robust transformer architectures to handle code-mixed and resource-scarce text, providing a scalable solution for multilingual natural language processing tasks.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Fake News Detection in Dravidian Languages Using Transliteration-Aware XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali
|
Durga Prasad Manukonda
This research introduces an innovative Attention BiLSTM-XLM-RoBERTa model for tackling the challenge of fake news detection in Malayalam datasets. By fine-tuning XLM-RoBERTa with Masked Language Modeling (MLM) on transliteration-aware data, the model effectively bridges linguistic and script diversity, seamlessly integrating native, Romanized, and mixed-script text. Although most of the training data is monolingual, the proposed approach demonstrates robust performance in handling diverse script variations. Achieving a macro F1-score of 0.5775 and securing top rankings in the shared task, this work highlights the potential of multilingual models in addressing resource-scarce language challenges and sets a foundation for future advancements in fake news detection.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Multimodal Hate Speech Detection in Malayalam Using Attention-Driven BiLSTM, Malayalam-Topic-BERT, and Fine-Tuned Wav2Vec 2.0
Durga Prasad Manukonda
|
Rohith Gowtham Kodali
|
Daniel Iglesias
This research presents a robust multimodal framework for hate speech detection in Malayalam, combining fine-tuned Wav2Vec 2.0, Malayalam-Doc-Topic-BERT, and an Attention-Driven BiLSTM architecture. The proposed approach effectively integrates acoustic and textual features, achieving a macro F1-score of 0.84 on the Malayalam test set. Fine-tuning Wav2Vec 2.0 on Malayalam speech data and leveraging Malayalam-Doc-Topic-BERT significantly improved performance over prior methods using openly available models. The results highlight the potential of language-specific models and advanced multimodal fusion techniques for addressing nuanced hate speech categories, setting the stage for future work on Dravidian languages like Tamil and Telugu.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Dravidian Languages Using XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali
|
Durga Prasad Manukonda
|
Maharajan Pannakkaran
This study presents a hybrid model integrating TamilXLM-RoBERTa and MalayalamXLM-RoBERTa with BiLSTM and attention mechanisms to classify AI-generated and human-written product reviews in Tamil and Malayalam. The model employs a transliteration-based fine-tuning strategy, effectively handling native, Romanized, and mixed-script text. Despite being trained on a relatively small portion of data, our approach demonstrates strong performance in distinguishing AI-generated content, achieving competitive macro F1 scores in the DravidianLangTech 2025 shared task. The proposed method showcases the effectiveness of multilingual transformers and hybrid architectures in tackling low-resource language challenges.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media Using XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali
|
Durga Prasad Manukonda
|
Maharajan Pannakkaran
This research investigates abusive comment detection in Tamil and Malayalam, focusing on code-mixed, multilingual social media text. A hybrid Attention BiLSTM-XLM-RoBERTa model was utilized, combining fine-tuned embeddings, sequential dependencies, and attention mechanisms. Despite computational constraints limiting fine-tuning to a subset of the AI4Bharath dataset, the model achieved competitive macro F1-scores, ranking 6th for both Tamil and Malayalam datasets with minor performance differences. The results emphasize the potential of multilingual transformers and the need for further advancements, particularly in addressing linguistic diversity, transliteration complexity, and computational limitations.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Multimodal Misogyny Meme Detection in Low-Resource Dravidian Languages Using Transliteration-Aware XLM-RoBERTa, ResNet-50, and Attention-BiLSTM
Durga Prasad Manukonda
|
Rohith Gowtham Kodali
Detecting misogyny in memes is challenging due to their multimodal nature, especially in low-resource languages like Tamil and Malayalam. This paper presents our work in the Misogyny Meme Detection task, utilizing both textual and visual features. We propose an Attention-Driven BiLSTM-XLM-RoBERTa-ResNet model, combining a transliteration-aware fine-tuned XLM-RoBERTa for text analysis and ResNet-50 for image feature extraction. Our model achieved Macro-F1 scores of 0.8805 for Malayalam and 0.8081 for Tamil, demonstrating competitive performance. However, challenges such as class imbalance and domain-specific image representation persist. Our findings highlight the need for better dataset curation, task-specific fine-tuning, and advanced fusion techniques to enhance multimodal hate speech detection in Dravidian languages.
pdf
bib
abs
byteSizedLLM@DravidianLangTech 2025: Sentiment Analysis in Tamil Using Transliteration-Aware XLM-RoBERTa and Attention-BiLSTM
Durga Prasad Manukonda
|
Rohith Gowtham Kodali
This study investigates sentiment analysis in code-mixed Tamil-English text using an Attention BiLSTM-XLM-RoBERTa model, combining multilingual embeddings with sequential context modeling to enhance classification performance. The model was fine-tuned using masked language modeling and trained with an attention-based BiLSTM classifier to capture sentiment patterns in transliterated and informal text. Despite computational constraints limiting pretraining, the approach achieved a Macro f1 of 0.5036 and ranked first in the competition. The model performed best on the Positive class, while Mixed Feelings and Unknown State showed lower recall due to class imbalance and ambiguity. Error analysis reveals challenges in handling non-standard transliterations, sentiment shifts, and informal language variations in social media text. These findings demonstrate the effectiveness of transformer-based multilingual embeddings and sequential modeling for sentiment classification in code-mixed text.
pdf
bib
abs
SSNCSE@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Sreeja K
|
Bharathi B
Hate speech detection is a serious challenge due to the different digital media communication, particularly in low-resource languages. This research focuses on the problem of multimodal hate speech detection by incorporating both textual and audio modalities. In the context of social media platforms, hate speech is conveyed not only through text but also through audios, which may further amplify harmful content. In order to manage the issue, we provide a multiclass classification model that influences both text and audio features to detect and categorize hate speech in low-resource languages. The model uses machine learning models for text analysis and audio processing, allowing it to efficiently capture the complex relationships between the two modalities. Class weight mechanism involves avoiding overfitting. The prediction has been finalized using the majority fusion technique. Performance is measured using a macro average F1 score metric. Three languages—Tamil, Malayalam, and Telugu—have the optimal F1-scores, which are 0.59, 0.52, and 0.33.
pdf
bib
abs
Bridging Linguistic Complexity: Sentiment Analysis of Tamil Code-Mixed Text Using Meta-Model
Anusha M D Gowda
|
Deepthi Vikram
|
Parameshwar R Hegde
Sentiment analysis in code-mixed languages poses significant challenges due to the complex nature of mixed-language text. This study explores sentiment analysis on Tamil code-mixed text using deep learning models such as Long Short-Term Memory (LSTM), hybrid models like Convolutional Neural Network (CNN) + Gated Recurrent Unit (GRU) and LSTM + GRU, along with meta-models including Logistic Regression, Random Forest, and Decision Tree. The LSTM+GRU hybrid model achieved an accuracy of 0.31, while the CNN+GRU hybrid model reached 0.28. The Random Forest meta-model demonstrated exceptional performance on the development set with an accuracy of 0.99. However, its performance dropped significantly on the test set, achieving an accuracy of 0.1333. The study results emphasize the potential of meta-model-based classification for improving performance in NLP tasks.
pdf
bib
abs
YenCS@DravidianLangTech 2025: Integrating Hybrid Architectures for Fake News Detection in Low-Resource Dravidian Languages
Anusha M D Gowda
|
Parameshwar R Hegde
Detecting fake news in under-resourced Dravidian languages is a rigorous task due to the scarcity of annotated datasets and the intricate nature of code-mixed text. This study tackles these issues by employing advanced machine learning techniques for two key classification tasks, the first task involves binary classification achieving a macro-average F1-score of 0.792 using a hybrid fusion model that integrates Bidirectional Recurrent Neural Network (Bi-RNN) and Long Short-Term Memory (LSTM)-Recurrent Neural Network (RNN) with weighted averaging. The second task focuses on fine-grained classification, categorizing news where an LSTM-GRU hybrid model attained a macro-average F1-score of 0.26. These findings highlight the effectiveness of hybrid models in improving fake news detection for under-resourced languages. Additionally, this study provides a foundational framework that can be adapted to address similar challenges in other under-resourced languages, emphasizing the need for further research in this area.
pdf
bib
abs
Overview of the Shared Task on Multimodal Hate Speech Detection in Dravidian languages: DravidianLangTech@NAACL 2025
Jyothish Lal G
|
Premjith B
|
Bharathi Raja Chakravarthi
|
Saranya Rajiakodi
|
Bharathi B
|
Rajeswari Natarajan
|
Ratnavel Rajalakshmi
The detection of hate speech in social media platforms is very crucial these days. This is due to its adverse impact on mental health, social harmony, and online safety. This paper presents the overview of the shared task on Multimodal Hate Speech Detection in Dravidian Languages organized as part of DravidianLangTech@NAACL 2025. The task emphasizes detecting hate speech in social media content that combines speech and text. Here, we focus on three low-resource Dravidian languages: Malayalam, Tamil, and Telugu. Participants were required to classify hate speech in three sub-tasks, each corresponding to one of these languages. The dataset was curated by collecting speech and corresponding text from YouTube videos. Various machine learning and deep learning-based models, including transformer-based architectures and multimodal frameworks, were employed by the participants. The submissions were evaluated using the macro F1 score. Experimental results underline the potential of multimodal approaches in advancing hate speech detection for low-resource languages. Team SSNTrio achieved the highest F1 score in Malayalam and Tamil of 0.7511 and 0.7332, respectively. Team lowes scored the best F1 score of 0.3817 in the Telugu sub-task.
pdf
bib
abs
Overview of the Shared Task on Detecting AI Generated Product Reviews in Dravidian Languages: DravidianLangTech@NAACL 2025
Premjith B
|
Nandhini Kumaresh
|
Bharathi Raja Chakravarthi
|
Thenmozhi Durairaj
|
Balasubramanian Palani
|
Sajeetha Thavareesan
|
Prasanna Kumar Kumaresan
The detection of AI-generated product reviews is critical due to the increased use of large language models (LLMs) and their capability to generate convincing sentences. The AI-generated reviews can affect the consumers and businesses as they influence the trust and decision-making. This paper presents the overview of the shared task on Detecting AI-generated product reviews in Dravidian Languages” organized as part of DravidianLangTech@NAACL 2025. This task involves two subtasks—one in Malayalam and another in Tamil, both of which are binary classifications where a review is to be classified as human-generated or AI-generated. The dataset was curated by collecting comments from YouTube videos. Various machine learning and deep learning-based models ranging from SVM to transformer-based architectures were employed by the participants.
pdf
bib
abs
Girma@DravidianLangTech 2025: Detecting AI Generated Product Reviews
Girma Yohannis Bade
|
Muhammad Tayyab Zamir
|
Olga Kolesnikova
|
José Luis Oropeza
|
Grigori Sidorov
|
Alexander Gelbukh
The increasing prevalence of AI-generated content, including fake product reviews, poses significant challenges in maintaining authenticity and trust in e-commerce systems. While much work has focused on detecting such reviews in high-resource languages, limited attention has been given to low-resource languages like Malayalam and Tamil. This study aims to address this gap by developing a robust framework to identify AI-generated product reviews in these languages. We explore a BERT-based approach for this task. Our methodology involves fine-tuning a BERT-based model specifically on Malayalam and Tamil datasets. The experiments are conducted using labeled datasets that contain a mix of human-written and AI-generated reviews. Performance is evaluated using the macro F1 score. The results show that the BERT-based model achieved a macro F1 score of 0.6394 for Tamil and 0.8849 for Malayalam. Preliminary results indicate that the BERT-based model performs significantly better for Malayalam than for Tamil in terms of the average Macro F1 score, leveraging its ability to capture the complex linguistic features of these languages. Finally, we open the source code of the implementation in the GitHub repository: AI-Generated-Product-Review-Code
pdf
bib
abs
Beyond_Tech@DravidianLangTech 2025: Political Multiclass Sentiment Analysis using Machine Learning and Neural Network
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Sanjai R
|
Mohammed Sameer
|
Motheeswaran K
Research on political feeling is essential for comprehending public opinion in the digital age, as social media and news platforms are often the sites of discussions. To categorize political remarks into sentiments like positive, negative, neutral, opinionated, substantiated, and sarcastic, this study offers a multiclass sentiment analysis approach. We trained models, such as Random Forest and a Feedforward Neural Network, after preprocessing and feature extraction from a large dataset of political texts using Natural Language Processing approaches. The Random Forest model, which was great at identifying more complex attitudes like sar casm and opinionated utterances, had the great est accuracy of 84%, followed closely by the Feedforward Neural Network model, which had 83%. These results highlight how well political discourse can be analyzed by combining deep learning and traditional machine learning techniques. There is also room for improvement by adding external metadata and using sophisticated models like BERT for better sentiment classification.
pdf
bib
abs
Misogynistic Meme Detection in Dravidian Languages Using Kolmogorov Arnold-based Networks
Manasha Arunachalam
|
Navneet Krishna Chukka
|
Harish Vijay V
|
Premjith B
|
Bharathi Raja Chakravarthi
The prevalence of misogynistic content online poses significant challenges to ensuring a safe and inclusive digital space for women. This study presents a pipeline to classify online memes as misogynistic or non misogynistic. The pipeline combines contextual image embeddings generated using the Vision Transformer Encoder (ViTE) model with text embeddings extracted from the memes using ModernBERT. These multimodal embeddings were fused and trained using three advanced types of Kolmogorov Artificial Networks (KAN): PyKAN, FastKAN, and Chebyshev KAN. The models were evaluated based on their F1 scores, demonstrating their effectiveness in addressing this issue. This research marks an important step towards reducing offensive online content, promoting safer and more respectful interactions in the digital world.
pdf
bib
abs
HTMS@DravidianLangTech 2025: Fusing TF-IDF and BERT with Dimensionality Reduction for Abusive Language Detection in Tamil and Malayalam
Bachu Naga Sri Harini
|
Kankipati Venkata Meghana
|
Kondakindi Supriya
|
Tara Samiksha
|
Premjith B
Detecting abusive and similarly toxic content posted on a social media platform is challenging due to the complexities of the language, data imbalance, and the code-mixed nature of the text. In this paper, we present our submissions for the shared task on abusive Tamil and Malayalam texts targeting women on social media—DravidianLangTech@NAACL 2025. We propose a hybrid embedding model that integrates embeddings generated using term frequency-inverse document frequency (TF-IDF) and BERT. To get rid of the differences in the embedding dimensions, we used a dimensionality reduction method with TF-IDF embedding. We submitted two more runs to the shared task, which involve a model based on TF-IDF embedding and another based on BERT-based embedding. The code for the submissions is available at https://github.com/Tarrruh/NLP_HTMS.
pdf
bib
abs
Team_Catalysts@DravidianLangTech 2025: Leveraging Political Sentiment Analysis using Machine Learning Techniques for Classifying Tamil Tweets
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Subhadevi K
|
Sowbharanika Janani Sivakumar
|
Rahul K
This work proposed a methodology for assessing political sentiments in Tamil tweets using machine learning models. The approach addressed linguistic challenges in Tamil text, including cleaning, normalization, tokenization, and class imbalance, through a robust preprocessing pipeline. Various models, including Random Forest, Logistic Regression, and CatBoost, were applied, with Random Forest achieving a macro F1-score of 0.2933 and securing 8th rank among 153 participants in the Codalab competition. This accomplishment highlights the effectiveness of machine learning models in handling the complexities of multilingual, code-mixed, and unstructured data in Tamil political discourse. The study also emphasized the importance of tailored preprocessing techniques to improve model accuracy and performance. It demonstrated the potential of computational linguistics and machine learning in understanding political discourse in low-resource languages like Tamil, contributing to advancements in regional sentiment analysis.
pdf
bib
abs
InnovationEngineers@DravidianLangTech 2025: Enhanced CNN Models for Detecting Misogyny in Tamil Memes Using Image and Text Classification
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Pooja Sree M
|
Palanimurugan V
|
Roshini Priya K
The rise of misogynistic memes on social media posed challenges to civil discourse. This paper aimed to detect misogyny in Dravidian language memes using a multimodal deep learning approach. We integrated Bidirectional Encoder Representations from Transformers (BERT), Long Short-Term Memory (LSTM), EfficientNet, and a Vision Language Model (VLM) to analyze textual and visual informa tion. EfficientNet extracted image features, LSTM captured sequential text patterns, and BERT learned language-specific embeddings. Among these, VLM achieved the highest accuracy of 85.0% and an F1-score of 70.8, effectively capturing visual-textual relationships. Validated on a curated dataset, our method outperformed baselines in precision, recall, and F1-score. Our approach ranked 12th out of 118 participants for the Tamil language, highlighting its competitive performance. This research emphasizes the importance of multimodal models in detecting harmful content. Future work can explore improved feature fusion techniques to enhance classification accuracy.
pdf
bib
abs
MysticCIOL@DravidianLangTech 2025: A Hybrid Framework for Sentiment Analysis in Tamil and Tulu Using Fine-Tuned SBERT Embeddings and Custom MLP Architectures
Minhaz Chowdhury
|
Arnab Laskar
|
Taj Ahmad
|
Azmine Toushik Wasi
Sentiment analysis is a crucial NLP task used to analyze opinions in various domains, including marketing, politics, and social media. While transformer-based models like BERT and SBERT have significantly improved sentiment classification, their effectiveness in low-resource languages remains limited. Tamil and Tulu, despite their widespread use, suffer from data scarcity, dialectal variations, and code-mixing challenges, making sentiment analysis difficult. Existing methods rely on traditional classifiers or word embeddings, which struggle to generalize in these settings. To address this, we propose a hybrid framework that integrates fine-tuned SBERT embeddings with a Multi-Layer Perceptron (MLP) classifier, enhancing contextual representation and classification robustness. Our framework achieves validation F1-scores of 0.4218 for Tamil and 0.3935 for Tulu and test F1-scores of 0.4299 in Tamil and 0.1546 on Tulu, demonstrating its effectiveness. This research provides a scalable solution for sentiment classification in low-resource languages, with future improvements planned through data augmentation and transfer learning. Our full experimental codebase is publicly available at: github.com/ciol-researchlab/NAACL25-Mystic-Tamil-Sentiment-Analysis.
pdf
bib
abs
KEC_AI_DATA_DRIFTERS@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Vishali K S
|
Priyanka B
|
Naveen Kumar K
Detecting fake news in Malayalam possess significant challenges due to linguistic diversity, code-mixing, and the limited availability of structured datasets. We participated in the Fake News Detection in Dravidian Languages shared task, classifying news and social media posts into binary and multi-class categories. Our experiments used traditional ML models: Support Vector Machine (SVM), Random Forest, Logistic Regression, Naive Bayes and transfer learning models: Multilingual Bert (mBERT) and XLNet. In binary classification, SVM achieved the highest macro-F1 score of 0.97, while in multi-class classification, it also outperformed other models with a macro-F1 score of 0.98. Random Forest ranked second in both tasks. Despite their advanced capabilities, mBERT and XLNet exhibited lower precision due to data limitations. Our approach enhances fake news detection and NLP solutions for low-resource languages.
pdf
bib
abs
KECEmpower@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Malliga Subramanian
|
Kogilavani Shanmugavadivel
|
Indhuja V S
|
Kowshik P
|
Jayasurya S
The detection of abusive text targeting women, especially in Dravidian languages like Tamil and Malayalam, presents a unique challenge due to linguistic complexities and code-mixing on social media. This paper evaluates machine learning models such as Support Vector Machines (SVM), Logistic Regression (LR), and Random Forest Classifiers (RFC) for identifying abusive content. Code-mixed datasets sourced from platforms like YouTube are used to train and test the models. Performance is evaluated using accuracy, precision, recall, and F1-score metrics. Our findings show that SVM outperforms the other classifiers in accuracy and recall. However, challenges persist in detecting implicit abuse and addressing informal, culturally nuanced language. Future work will explore transformer-based models like BERT for better context understanding, along with data augmentation techniques to enhance model performance. Additionally, efforts will focus on expanding labeled datasets to improve abuse detection in these low-resource languages.
pdf
bib
abs
KEC_AI_GRYFFINDOR@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
ShahidKhan S
|
Shri Sashmitha.s
|
Yashica S
It is difficult to detect hate speech in codemixed Dravidian languages because the data is multilingual and unstructured. We took part in the shared task to detect hate speech in text and audio data for Tamil, Malayalam, and Telugu in this research. We tested different machine learning and deep learning models such as Logistic Regression, Ridge Classifier, Random Forest, and CNN. For Tamil, Logistic Regression gave the best macro-F1 score of 0.97 for text, whereas Ridge Classifier was the best for audio with a score of 0.75. For Malayalam, Random Forest gave the best F1-score of 0.97 for text, and CNN was the best for audio (F1 score: 0.69). For Telugu, Ridge Classifier gave the best F1-score of 0.89 for text, whereas CNN was the best for audio (F1-score: 0.87).Our findings prove that a multimodal solution effi ciently tackles the intricacy of hate speech detection in Dravidian languages. In this shared task,out of 145 teams we attained the 12th rank for Tamil and 7th rank for Malayalam and Telugu.
pdf
bib
abs
KECLinguAIsts@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Malliga Subramanian
|
Rojitha R
|
Mithun Chakravarthy Y
|
Renusri R V
|
Kogilavani Shanmugavadivel
With the surge of AI-generated content in online spaces, ensuring the authenticity of product reviews has become a critical challenge. This paper addresses the task of detecting AI-generated product reviews in Dravidian languages, specifically Tamil and Malayalam, which present unique hurdles due to their complex morphology, rich syntactic structures, and code-mixed nature. We introduce a novel methodology combining machine learning classifiers with advanced multilingual transformer models to identify AI-generated reviews. Our approach not only accounts for the linguistic intricacies of these languages but also leverages domain specific datasets to improve detection accuracy. For Tamil, we evaluate Logistic Regression, Random Forest, and XGBoost, while for Malayalam, we explore Logistic Regression, Multinomial Naive Bayes (MNB), and Support Vector Machines (SVM). Transformer based models significantly outperform these traditional classifiers, demonstrating superior performance across multiple metrics.
pdf
bib
abs
Dll5143@DravidianLangTech 2025: Majority Voting-Based Framework for Misogyny Meme Detection in Tamil and Malayalam
Sarbajeet Pattanaik
|
Ashok Yadav
|
Vrijendra Singh
Misogyny memes pose a significant challenge on social networks, particularly in Dravidian-scripted languages, where subtle expressions can propagate harmful narratives against women. This paper presents our approach for the “Shared Task on MisogynyMeme Detection,” organized as part of DravidianLangTech@NAACL 2025, focusing on misogyny meme detection in Tamil andMalayalam. To tackle this problem, we proposed a multi-model framework that integrates three distinct models: M1 (ResNet-50 + google/muril-large-cased), M2 (openai/clipvit- base-patch32 + ai4bharat/indic-bert), and M3 (ResNet-50 + ai4bharat/indic-bert). Thefinal classification is determined using a majority voting mechanism, ensuring robustness by leveraging the complementary strengths ofthese models. This approach enhances classification performance by reducing biases and improving generalization. Our model achievedan F1 score of 0.77 for Tamil, significantly improving misogyny detection in the language. For Malayalam, the framework achieved anF1 score of 0.84, demonstrating strong performance. Overall, our method ranked 5th in Tamil and 4th in Malayalam, highlighting itscompetitive effectiveness in misogyny meme detection.
pdf
bib
abs
KEC_AI_VSS_run2@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Sathiyaseelan S
|
Suresh Babu K
|
Vasikaran S
The increasing instances of abusive language against women on social media platforms have brought to the fore the need for effective content moderation systems, especially in low-resource languages like Tamil and Malayalam. This paper addresses the challenge of detecting gender-based abuse in YouTube comments using annotated datasets in these languages. Comments are classified into abusive and non-abusive categories. We applied the following machine learning algorithms, namely Random Forest, Support Vector Machine, K-Nearest Neighbor, Gradient Boosting and AdaBoost for classification. Micro F1 score of 0.95 was achieved by SVM for Tamil and 0.72 by Random Forest for Malayalam. Our system participated in the shared task on abusive comment detection, out of 160 teams achieving the rank of 13th for Malayalam and rank 34 for Tamil, and both indicate both the challenges and potential of our approach in low-resource language processing. Our findings have highlighted the significance of tailored approaches to language-specific abuse detection.
pdf
bib
abs
The_Deathly_Hallows@DravidianLangTech 2025: AI Content Detection in Dravidian Languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Vasantharan K
|
Prethish G A
|
Vijayakumaran S
The DravidianLangTech@NAACL 2025 shared task focused on Detecting AI-generated Product Reviews in Dravidian Languages, aiming to address the challenge of distinguishing AI-generated content from human-written reviews in Tamil and Malayalam. As AI generated text becomes more prevalent, ensuring the authenticity of online product reviews is crucial for maintaining consumer trust and preventing misinformation. In this study, we explore various feature extraction techniques, including TF-IDF, Count Vectorizer, and transformer-based embeddings such as BERT-Base-Multilingual-Cased and XLM-RoBERTa-Large, to build a robust classification model. Our approach achieved F1-scores of 0.9298 for Tamil and 0.8797 for Malayalam, ranking 8th in Tamil and 11th in Malayalam among all participants. The results highlight the effectiveness of transformer-based embeddings in differentiating AI-generated and human-written content. This research contributes to the growing body of work on AI-generated content detection, particularly in underrepresented Dravidian languages, and provides insights into the challenges unique to these languages.
pdf
bib
abs
SSN_MMHS@DravidianLangTech 2025: A Dual Transformer Approach for Multimodal Hate Speech Detection in Dravidian Languages
Jahnavi Murali
|
Rajalakshmi Sivanaiah
The proliferation of the Internet and social media platforms has resulted in an alarming increase in online hate speech, negatively affecting individuals and communities worldwide. While most research focuses on text-based detection in English, there is an increasing demand for multilingual and multimodal approaches to address hate speech more effectively. This paper presents a methodology for multiclass hate speech classification in low-resource Indian languages namely, Malayalam, Telugu, and Tamil, as part of the shared task at DravidianLangTech 2025. Our proposed approach employs a dual transformer-based framework that integrates audio and text modalities, facilitating cross-modal learning to enhance detection capabilities. Our model achieved macro-F1 scores of 0.348, 0.1631, and 0.1271 in the Malayalam, Telugu, and Tamil subtasks respectively. Although the framework’s performance is modest, it provides valuable insights into the complexities of multimodal hate speech detection in low-resource settings and highlights areas for future improvement, including data augmentation, and alternate fusion and feature extraction techniques.
pdf
bib
abs
InnovateX@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Dravidian Languages
Moogambigai A
|
Pandiarajan D
|
Bharathi B
This paper presents our approach to the Shared Task on Detecting AI-Generated Product Reviews in Dravidian Languages as part of DravidianLangTech@NAACL 2025. The task focuses on distinguishing between human-written and AI-generated reviews in Tamil and Malayalam, languages rich in linguistic complexities. Using the provided datasets, we implemented machine learning and deep learning models, including Logistic Regression (LR), Support Vector Machine (SVM), and BERT. Through preprocessing techniques like tokenization and TF-IDF vectorization, we achieved competitive results, with our SVM and BERT models demonstrating superior performance in Tamil and Malayalam respectively. Our findings underscore the unique challenges of working with Dravidian languages in this domain and highlight the importance of robust feature extraction.
pdf
bib
abs
KSK@DravidianLangTech 2025: Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments Using Incremental Learning
Kalaivani K S
|
Sanjay R
|
Thissyakkanna S M
|
Nirenjhanram S K
The introduction of Jio in India has significantly increased the number of social media users, particularly on platforms like X (Twitter), Facebook, Instagram. While this growth is positive, it has also led to a rise in native language speakers, making social media analysis more complex. In this study, we focus on Tamil, a Dravidian language, and aim to classify social media comments from X (Twitter) into seven different categories. Tamil speaking users often communicate using a mix of Tamil and English, creating unique challenges for analysis and tracking. This surge in diverse language usage on social media highlights the need for robust sentiment analysis tools to ensure the platform remains accessible and user-friendly for everyone with different political opinions. In this study we trained four machine learning models, SGD Classifier, Random Forest Classifier, Decision Tree, and Multinomial Naive Bayes classifier to identify and classify the comments. Among these, the SGD Classifier achieved the best performance, with a training accuracy of 83.67% and a validation accuracy of 80.43%.
pdf
bib
abs
BlueRay@DravidianLangTech-2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Aiswarya M
|
Aruna T
|
Jeevaananth S
The rise of fake news presents significant issues, particularly for underrepresented lan guages. This study tackles fake news identification in Dravidian languages with two subtasks: binary classification of YouTube comments and multi-class classification of Malayalam news into five groups. Text preprocessing, vectorization, and transformer-based embeddings are all part of the methodology, including baseline comparisons utilizing classic machine learning, deep learning, and transfer learning models. In Task 1, our solution placed 17th, displaying acceptable binary classification per formance. In Task 2, we finished eighth place by effectively identifying nuanced categories of Malayalam news, demonstrating the efficacy of transformer-based models.
pdf
bib
abs
KEC_AI_ZEROWATTS@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Naveenram C E
|
Vishal Rs
|
Srinesh S
Hate speech detection in code-mixed Dravidian languages presents significant challenges due to the multilingual and unstructured nature of the data. In this work, we participated in the shared task to detect hate speech in Tamil, Malayalam, and Telugu using both text and audio data. We explored various machine learning models, including Logistic Regression, Ridge Classifier, Random Forest, and Convolutional Neural Networks (CNN). For Tamil text data, Logistic Regression achieved the highest macro-F1 score of 0.97, while Ridge Classifier performed best for audio with 0.75. In Malayalam, Random Forest excelled for text with 0.97, and CNN for audio with 0.69. For Telugu, Ridge Classifier achieved 0.89 for text and CNN 0.87 for audio.These results demonstrate the efficacy of our multimodal approach in addressing the complexity of hate speech detection across the Dravidian languages.Tamil:11th rank, Malayalam :6th rank,Telugu:8th rank among 145 teams
pdf
bib
abs
MNLP@DravidianLangTech 2025: A Deep Multimodal Neural Network for Hate Speech Detection in Dravidian Languages
Shraddha Chauhan
|
Abhinav Kumar
Social media hate speech is a significant issue because it may incite violence, discrimination, and social unrest. Anonymity and reach of such platforms enable the rapid spread of harmful content, targeting individuals or communities based on race, gender, religion, or other attributes. The detection of hate speech is very important for the creation of safe online environments, protection of marginalized groups, and compliance with legal and ethical standards. This paper aims to analyze complex social media content using a combination of textual and audio features. The experimental results establish the effectiveness of the proposed approach, with F1-scores reaching 72% for Tamil, 77% for Malayalam, and 36% for Telugu. Such results strongly indicate that multimodal methodologies have significant room for improvement in hate speech detection in resource-constrained languages and underscore the need to continue further research into this critical area.
pdf
bib
abs
MSM_CUET@DravidianLangTech 2025: XLM-BERT and MuRIL Based Transformer Models for Detection of Abusive Tamil and Malayalam Text Targeting Women on Social Media
Md Mizanur Rahman
|
Srijita Dhar
|
Md Mehedi Hasan
|
Hasan Murad
Social media has evolved into an excellent platform for presenting ideas, viewpoints, and experiences in modern society. But this large domain has also brought some alarming problems including internet misuse. Targeted specifically at certain groups like women, abusive language is pervasive on social media. The task is always difficult to detect abusive text for low-resource languages like Tamil, Malayalam, and other Dravidian languages. It is crucial to address this issue seriously, especially for Dravidian languages. This paper presents a novel approach to detecting abusive Tamil and Malayalam texts targeting social media. A shared task on Abusive Tamil and Malayalam Text Targeting Women on Social Media Detection has been organized by DravidianLangTech at NAACL-2025. The organizer has provided an annotated dataset that labels two classes: Abusive and Non-Abusive. We have implemented our model with different transformer-based models like XLM-R, MuRIL, IndicBERT, and mBERT transformers and the Ensemble method with SVM and Random Forest for training. We selected XLM-RoBERT for Tamil text and MuRIL for Malayalam text due to their superior performance compared to other models. After developing our model, we tested and evaluated it on the DravidianLangTech@NAACL 2025 shared task dataset. We found that XLM-R has provided the best result for abusive Tamil text detections with an F1 score of 0.7873 on the test set and ranked 2nd position among all participants. On the other hand, MuRIL has provided the best result for abusive Malayalam text detections with an F1 score of 0.6812 and ranked 10th among all participants.
pdf
bib
abs
MNLP@DravidianLangTech 2025: Transformer-based Multimodal Framework for Misogyny Meme Detection
Shraddha Chauhan
|
Abhinav Kumar
A meme is essentially an artefact of content- usually an amalgamation of a picture, text, or video-content that spreads like wildfire on the internet, usually shared for amusement, cultural expression, or commentary. They are very much similar to an inside joke or a cultural snapshot that reflects shared ideas, emotions, or social commentary, remodulated and reformed by communities. Some of them carry harmful content, such as misogyny. A misogynistic meme is social commentary that espouses negative stereotypes, prejudice, or hatred against women. The detection and addressing of such content will help make the online space inclusive and respectful. The work focuses on developing a multimodal approach for categorizing misogynistic and non-misogynistic memes through the use of pretrained XLM-RoBERTa to draw text features and Vision Transformer to draw image features. The combination of both text and images features are processed into a machine learning and deep learning model which have attained F1-scores 0.77 and 0.88, respectively Tamil and Malayalam for misogynist Meme Dataset.
pdf
bib
abs
Code_Conquerors@DravidianLangTech 2025: Deep Learning Approach for Sentiment Analysis in Tamil and Tulu
Harish Vijay V
|
Ippatapu Venkata Srichandra
|
Pathange Omkareshwara Rao
|
Premjith B
In this paper we propose a novel approach to sentiment analysis in languages with mixed Dravidian codes, specifically Tamil-English and Tulu-English social media text. We introduce an innovative hybrid deep learning architecture that uniquely combines convolutional and recurrent neural networks to effectively capture both local patterns and long-term dependencies in code-mixed text. Our model addresses critical challenges in low-resource language processing through a comprehensive preprocessing pipeline and specialized handling of class imbalance and out-of-vocabulary words. Evaluated on a substantial dataset of social media comments, our approach achieved competitive macro F1 scores of 0.3357 for Tamil (ranked 18) and 0.3628 for Tulu (ranked 13)
pdf
bib
abs
KEC_TECH_TITANS@DravidianLangTech 2025: Abusive Text Detection in Tamil and Malayalam Social Media Comments Using Machine Learning
Malliga Subramanian
|
Kogilavani Shanmugavadivel
|
Deepiga P
|
Dharshini S
|
Ananthakumar S
|
Praveenkumar C
Social media platforms have become a breeding ground for hostility and toxicity, with abusive language targeting women becoming a pervasive issue. This paper addresses the detection of abusive content in Tamil and Malayalam social media comments using machine learning models. We experimented with GRU, LSTM, Bidirectional LSTM, CNN, FastText, and XGBoost models, evaluating their performance on a code-mixed dataset of Tamil and Malayalam comments collected from YouTube. Our findings demonstrate that FastText and CNN models yielded the best performance among the evaluated classifiers, achieving F1-scores of 0.73 each. This study contributes to the ongoing research on abusive text detection for under-resourced languages and highlights the need for robust, scalable solutions to combat online toxicity.
pdf
bib
abs
F2 (FutureFiction): Detection of Fake News on Futuristic Technology
Msvpj Sathvik
|
Venkatesh Velugubantla
|
Ravi Teja Potla
There is widespread of misinformation on futuristic technology and society. To accurately detect such news, the algorithms require up-to-date knowledge. The Large Language Models excel in the NLP but cannot retrieve the ongoing events or innovations. For example, GPT and it’s variants are restricted till the knowledge of 2021. We introduce a new methodology for the identification of fake news pertaining to futuristic technology and society. Leveraging the power of Google Knowledge, we enhance the capabilities of the GPT-3.5 language model, thereby elevating its performance in the detection of misinformation. The proposed framework exhibits superior efficacy compared to established baselines with the accuracy of 81.04%. Moreover, we propose a novel dataset consisting of fake news in three languages English, Telugu and Tenglish of around 21000 from various sources.
pdf
bib
abs
JustATalentedTeam@DravidianLangTech 2025: A Study of ML and DL approaches for Sentiment Analysis in Code-Mixed Tamil and Tulu Texts
Ponsubash Raj R
|
Paruvatha Priya B
|
Bharathi B
The growing prevalence of code-mixed text on social media presents unique challenges for sen- timent analysis, particularly in low-resource languages like Tamil and Tulu. This paper ex- plores sentiment classification in Tamil-English and Tulu-English code-mixed datasets using both machine learning (ML) and deep learn- ing (DL) approaches. The ML model utilizes TF-IDF feature extraction combined with a Logistic Regression classifier, while the DL model employs FastText embeddings and a BiLSTM network enhanced with an attention mechanism. Experimental results reveal that the ML model outperforms the DL model in terms of macro F1-score for both languages. Specifically, for Tamil, the ML model achieves a macro F1-score of 0.46, surpassing the DL model’s score of 0.43. For Tulu, the ML model significantly outperforms the DL model, achiev- ing 0.60 compared to 0.48. This performance disparity is more pronounced in Tulu due to its smaller dataset size of 13,308 samples com- pared to Tamil’s 31,122 samples, highlight- ing the data efficiency of ML models in low- resource settings. The study provides insights into the strengths and limitations of each ap- proach, demonstrating that traditional ML tech- niques remain competitive for code-mixed sen- timent analysis when data is limited. These findings contribute to ongoing research in mul- tilingual NLP and offer practical implications for applications such as social media monitor- ing, customer feedback analysis, and conversa- tional AI in Dravidian languages.
pdf
bib
abs
KEC_TECH_TITANS@DravidianLangTech 2025:Sentiment Analysis for Low-Resource Languages: Insights from Tamil and Tulu using Deep Learning and Machine Learning Models
Malliga Subramanian
|
Kogilavani Shanmugavadivel
|
Dharshini S
|
Deepiga P
|
Praveenkumar C
|
Ananthakumar S
Sentiment analysis in Dravidian languages like Tamil and Tulu presents significant challenges due to their linguistic diversity and limited resources for natural language processing (NLP). This study explores sentiment classification for Tamil and Tulu, focusing on the complexities of handling both languages, which differ in script, grammar, and vocabulary. We employ a variety of machine learning and deep learning techniques, including traditional models like Support Vector Machines (SVM), and K-Nearest Neighbors (KNN), as well as advanced transformer-based models like BERT and multilingual BERT (mBERT). A key focus of this research is to evaluate the performance of these models on sentiment analysis tasks, considering metrics such as accuracy, precision, recall, and F1-score. The results show that transformer-based models, particularly mBERT, significantly outperform traditional machine learning models in both Tamil and Tulu sentiment classification. This study also highlights the need for further research into addressing challenges like language-specific nuances, dataset imbalance, and data augmentation techniques for improved sentiment analysis in under-resourced languages like Tamil and Tulu.
pdf
bib
abs
Code_Conquerors@DravidianLangTech 2025: Multimodal Misogyny Detection in Dravidian Languages Using Vision Transformer and BERT
Pathange Omkareshwara Rao
|
Harish Vijay V
|
Ippatapu Venkata Srichandra
|
Neethu Mohan
|
Sachin Kumar S
This research focuses on misogyny detection in Dravidian languages using multimodal techniques. It leverages advanced machine learning models, including Vision Transformers (ViT) for image analysis and BERT-based transformers for text processing. The study highlights the challenges of working with regional datasets and addresses these with innovative preprocessing and model training strategies. The evaluation reveals significant improvements in detection accuracy, showcasing the potential of multimodal approaches in combating online abuse in underrepresented languages.
pdf
bib
abs
YenLP_CS@DravidianLangTech 2025: Sentiment Analysis on Code-Mixed Tamil-Tulu Data Using Machine Learning and Deep Learning Models
Raksha Adyanthaya
|
Rathnakara Shetty P
The sentiment analysis in code-mixed Dravidian languages such as Tamil-English and Tulu-English is the focus of this study because these languages present difficulties for conventional techniques. In this work, We used ensembles, multilingual Bidirectional Encoder Representation(mBERT), Bidirectional Long Short Term Memory (BiLSTM), Random Forest (RF), Support Vector Machine (SVM), and preprocessing in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec feature extraction. mBERT obtained accuracy of 64% for Tamil and 68% for Tulu on development datasets. In test sets, the ensemble model gave Tamil a macro F1-score of 0.4117, while mBERT gave Tulu a macro F1-score of 0.5511. With regularization and data augmentation, these results demonstrate the approach’s potential for further advancements.
pdf
bib
abs
LinguAIsts@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Dhanyashree G
|
Kalpana K
|
Lekhashree A
|
Arivuchudar K
|
Arthi R
|
Bommineni Sahitya
|
Pavithra J
|
Sandra Johnson
Social media sites are becoming crucial sites for communication and interaction, yet they are increasingly being utilized to commit gender-based abuse, with horrific, harassing, and degrading comments targeted at women. This paper tries to solve the common issue of women being subjected to abusive language in two South Indian languages, Malayalam and Tamil. To find explicit abuse, implicit bias, preconceptions, and coded language, we were given a set of YouTube comments labeled Abusive and Non-Abusive. To solve this problem, we applied and compared different machine learning models, i.e., Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes classifiers, to classify comments into the given categories. The models were trained and validated using the given dataset to achieve the best performance with respect to accuracy and macro F1 score. The solutions proposed aim to make robust content moderation systems that can detect and prevent abusive language, ensuring safer online environments for women.
pdf
bib
abs
KEC-Elite-Analysts@DravidianLangTech 2025: Deciphering Emotions in Tamil-English and Code-Mixed Social Media Tweets
Malliga Subramanian
|
Aruna A
|
Anbarasan T
|
Amudhavan M
|
Jahaganapathi S
|
Kogilavani Shanmugavadivel
Sentiment analysis in code-mixed languages, particularly Tamil-English, is a growing challenge in natural language processing (NLP) due to the prevalence of multilingual communities on social media. This paper explores various machine learning and transformer-based models, including Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), BERT, and mBERT, for sentiment classification of Tamil-English code-mixed text. The models are evaluated on a shared task dataset provided by DravidianLangTech@NAACL 2025, with performance measured through accuracy, precision, recall, and F1-score. Our results demonstrate that transformer-based models, particularly mBERT, outperform traditional classifiers in identifying sentiment polarity. Future work aims to address the challenges posed by code-switching and class imbalance through advanced model architectures and data augmentation techniques.
pdf
bib
abs
Cyber Protectors@DravidianLangTech 2025: Abusive Tamil and Malayalam Text Targeting Women on Social Media using FastText
Rohit Vp
|
Madhav M
|
Ippatapu Venkata Srichandra
|
Neethu Mohan
|
Sachin Kumar S
Social media has transformed communication, but it has opened new ways for women to be abused. Because of complex morphology, large vocabulary, and frequent code-mixing of Tamil and Malayalam, it might be especially challenging to identify discriminatory text in linguistically diverse settings. Because traditional moderation systems frequently miss these linguistic subtleties, gendered abuse in many forms—from outright threats to character insults and body shaming—continues. In addition to examining the sociocultural characteristics of this type of harassment on social media, this study compares the effectiveness of several Natural Language Processing (NLP) models, such as FastText, transformer-based architectures, and BiLSTM. Our results show that FastText achieved an macro f1 score of 0.74 on the Tamil dataset and 0.64 on the Malayalam dataset, outperforming the Transformer model which achieved a macro f1 score of 0.62 and BiLSTM achieved 0.57. By addressing the limitations of existing moderation techniques, this research underscores the urgent need for language-specific AI solutions to foster safer digital spaces for women.
pdf
bib
abs
LinguAIsts@DravidianLangTech 2025: Misogyny Meme Detection using multimodel Approach
Arthi R
|
Pavithra J
|
Dr G Manikandan
|
Lekhashree A
|
Dhanyashree G
|
Bommineni Sahitya
|
Arivuchudar K
|
Kalpana K
Memes often disseminate misogynistic material, which nurtures gender discrimination and stereotyping. While it is an effective tool of communication, social media has also provided a fertile ground for online abuse. This vital issue in the multilingual and multimodal setting is tackled by the Misogyny Meme Detection Shared Task. Our method employs advanced NLP techniques and machine learning models to classify memes in Malayalam and Tamil, two low-resource languages. Preprocessing of text includes tokenization, lemmatization, and stop word removal. Features are then extracted using TF-IDF. With the best achievable hyperparameters, along with the SVM model, our system provided very promising outcomes and ranked 9th among the systems competing in the Tamil task with a 0.71259 F1-score, and ranked 15th with an F1-score of 0.68186 in the Malayalam taks. With this research work, it would be established how important AI-based solutions are toward stopping online harassment and developing secure online spaces.
pdf
bib
abs
CUET_Agile@DravidianLangTech 2025: Fine-tuning Transformers for Detecting Abusive Text Targeting Women from Tamil and Malayalam Texts
Tareque Md Hanif
|
Md Rashadur Rahman
As social media has grown, so has online abuse, with women often facing harmful online behavior. This discourages their free participation and expression online. This paper outlines the approach adopted by our team for detecting abusive comments in Tamil and Malayalam. The task focuses on classifying whether a given comment contains abusive language towards women. We experimented with transformer based models by fine-tuning Tamil-BERT for Tamil and Malayalam-BERT for Malayalam. Additionally, we fine-tuned IndicBERT v2 on both Tamil and Malayalam datasets. To evaluate the effect of pre-processing, we also conducted experiments using non-preprocessed text. Results demonstrate that IndicBERT v2 outperformed the language-specific BERT models in both languages. Pre-processing the data showed mixed results, with a slight improvement in the Tamil dataset but no significant benefit for the Malayalam dataset. Our approach secured first place in Tamil with a macro F1-score of 0.7883 and second place in Malayalam with a macro F1-score of 0.7234. The implementation details of the task will be found in the GitHub repository.
pdf
bib
abs
Necto@DravidianLangTech 2025: Fine-tuning Multilingual MiniLM for Text Classification in Dravidian Languages
Livin Nector Dhasan
This paper explores the application of a fine-tuned Multilingual MiniLM model for various binary text classification tasks, including AI-generated product review detection, abusive language targeting woman detection, and fake news detection in the Dravidian languages Tamil and Malayalam. This work was done as part of submissions to shared tasks organized by DravidianLangTech@NAACL 2025. The model was fine-tuned using both Tamil and Malayalam datasets, and its performance was evaluated across different tasks using macro F1-score. The results indicate that this model produces performance that is very close to the best F1 score reported by other teams. An investigation is conducted on the AI-generated product review dataset and the findings are reported.
pdf
bib
abs
CUET-823@DravidianLangTech 2025: Shared Task on Multimodal Misogyny Meme Detection in Tamil Language
Arpita Mallik
|
Ratnajit Dhar
|
Udoy Das
|
Momtazul Arefin Labib
|
Samia Rahman
|
Hasan Murad
Misogynous content on social media, especially in memes, present challenges due to the complex reciprocation of text and images that carry offensive messages. This difficulty mostly arises from the lack of direct alignment between modalities and biases in large-scale visio-linguistic models. In this paper, we present our system for the Shared Task on Misogyny Meme Detection - DravidianLangTech@NAACL 2025. We have implemented various unimodal models, such as mBERT and IndicBERT for text data, and ViT, ResNet, and EfficientNet for image data. Moreover, we have tried combining these models and finally adopted a multimodal approach that combined mBERT for text and EfficientNet for image features, both fine-tuned to better interpret subtle language and detailed visuals. The fused features are processed through a dense neural network for classification. Our approach achieved an F1 score of 0.78120, securing 4th place and demonstrating the potential of transformer-based architectures and state-of-the-art CNNs for this task.
pdf
bib
abs
Hermes@DravidianLangTech 2025: Sentiment Analysis of Dravidian Languages using XLM-RoBERTa
Emmanuel George P
|
Ashiq Firoz
|
Madhav Murali
|
Siranjeevi Rajamanickam
|
Balasubramanian Palani
Sentiment analysis, the task of identifying subjective opinions or emotional responses, has become increasingly significant with the rise of social media. However, analysing sentiment in Dravidian languages such as Tamil-English and Tulu-English presents unique challenges due to linguistic code-switching (where people tend to mix multiple languages) and non-native scripts. Traditional monolingual sentiment analysis models struggle to address these complexities effectively. This research explores a fine-tuned transformer model based on the XLM-RoBERTa model for sentiment detection. It utilizes the tokenizer from the XLM-RoBERTa model for text preprocessing. Additionally, the performance of the XLM-RoBERTa model was compared with traditional machine learning models such as Logistic Regression (LR) and Random Forest (RF), as well as other transformer-based models like BERT and RoBERTa. This research was based on our work for the Sentiment Analysis in Tamil and Tulu DravidianLangTech@NAACL 2025 competition, where we received a macro F1-score of 59% for the Tulu dataset and 49% for the Tamil dataset, placing third in the competition.
pdf
bib
abs
SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers
J Bhuvana
|
Mirnalinee T T
|
Rohan R
|
Diya Seshan
|
Avaneesh Koushik
The increasing prevalence of AI-generated content has raised concerns about the authenticity and reliability of online reviews, particularly in resource-limited languages like Tamil and Malayalam. This paper presents an approach to the Shared Task on Detecting AI-generated Product Reviews in Dravidian Languages at NAACL2025, which focuses on distinguishing AI-generated reviews from human-written ones in Tamil and Malayalam. Several transformer-based models, including IndicBERT, RoBERTa, mBERT, and XLM-R, were evaluated, with language-specific BERT models for Tamil and Malayalam demonstrating the best performance. The chosen methodologies were evaluated using Macro Average F1 score. In the rank list released by the organizers, team SSNTrio, achieved ranks of 3rd and 29th for the Malayalam and Tamil datasets with Macro Average F1 Scores of 0.914 and 0.598 respectively.
pdf
bib
abs
SSNTrio@DravidianLangTech 2025: Sentiment Analysis in Dravidian Languages using Multilingual BERT
J Bhuvana
|
Mirnalinee T T
|
Diya Seshan
|
Rohan R
|
Avaneesh Koushik
This paper presents an approach to sentiment analysis for code-mixed Tamil-English and Tulu-English datasets as part of the DravidianLangTech@NAACL 2025 shared task. Sentiment analysis, the process of determining the emotional tone or subjective opinion in text, has become a critical tool in analyzing public sentiment on social media platforms. The approach discussed here uses multilingual BERT (mBERT) fine-tuned on the provided datasets to classify sentiment polarity into various predefined categories: for Tulu, the categories were positive, negative, not_tulu, mixed, and neutral; for Tamil, the categories were positive, negative, unknown, mixed_feelings, and neutral. The mBERT model demonstrates its effectiveness in handling sentiment analysis for codemixed and resource-constrained languages by achieving an F1-score of 0.44 for Tamil, securing the 6th position in the ranklist; and 0.56 for Tulu, ranking 5th in the respective task.
pdf
bib
abs
NLP_goats@DravidianLangTech 2025: Detecting Fake News in Dravidian Languages: A Text Classification Approach
Srihari V K
|
Vijay Karthick Vaidyanathan
|
Thenmozhi Durairaj
The advent and expansion of social media have transformed global communication. Despite its numerous advantages, it has also created an avenue for the rapid spread of fake news, which can impact people’s decision-making and judgment. This study explores detecting fake news as part of the DravidianLangTech@NAACL 2025 shared task, focusing on two key tasks. The aim of Task 1 is to classify Malayalam social media posts as either original or fake, and Task 2 categorizes Malayalam-language news articles into five levels of truthfulness: False, Half True, Mostly False, Partly False, and Mostly True. We accomplished the tasks using transformer models, e.g., M-BERT and classifiers like Naive Bayes. Our results were promising, with M-BERT achieving the better results. We achieved a macro-F1 score of 0.83 for distinguishing between fake and original content in Task 1 and a score of 0.54 for classifying news articles in Task 2, ranking us 11 and 4, respectively.
pdf
bib
abs
NLP_goats@DravidianLangTech 2025: Towards Safer Social Media: Detecting Abusive Language Directed at Women in Dravidian Languages
Vijay Karthick Vaidyanathan
|
Srihari V K
|
Thenmozhi Durairaj
Social media in the present world is an essential communication platform for information sharing. But their emergence has now led to an increase in the proportion of online abuse, in particular against women in the form of abusive and offensive messages. A reflection of the social inequalities, the importance of detecting abusive language is highlighted by the fact that the usage has a profound psychological and social impact on the victims. This work by DravidianLangTech@NAACL 2025 aims at developing an automated abusive content detection system for women directed towards women on the Tamil and Malayalam platforms, two of the Dravidian languages. Based on a dataset of their YouTube comments about sensitive issues, the study uses multilingual BERT (mBERT) to detect abusive comments versus non-abusive ones. We achieved F1 scores of 0.75 in Tamil and 0.68 in Malayalam, placing us 13 and 9 respectively.
pdf
bib
abs
HerWILL@DravidianLangTech 2025: Ensemble Approach for Misogyny Detection in Memes Using Pre-trained Text and Vision Transformers
Neelima Monjusha Preeti
|
Trina Chakraborty
|
Noor Mairukh Khan Arnob
|
Saiyara Mahmud
|
Azmine Toushik Wasi
Misogynistic memes on social media perpetuate gender stereotypes, contribute to harassment, and suppress feminist activism. However, most existing misogyny detection models focus on high-resource languages, leaving a gap in low-resource settings. This work addresses that gap by focusing on misogynistic memes in Tamil and Malayalam, two Dravidian languages with limited resources. We combine computer vision and natural language processing for multi-modal detection, using CLIP embeddings for the vision component and BERT models trained on code-mixed hate speech datasets for the text component. Our results show that this integrated approach effectively captures the unique characteristics of misogynistic memes in these languages, achieving competitive performance with a Macro F1 Score of 0.7800 for the Tamil test set and 0.8748 for the Malayalam test set. These findings highlight the potential of multimodal models and the adaptation of pre-trained models to specific linguistic and cultural contexts, advancing misogyny detection in low-resource settings. Code available at https://github.com/HerWILL-Inc/NAACL-2025
pdf
bib
abs
Cognitext@DravidianLangTech2025: Fake News Classification in Malayalam Using mBERT and LSTM
Shriya Alladi
|
Bharathi B
Fake news detection is a crucial task in combat- ing misinformation, particularly in underrepresented languages such as Malayalam. This paper focuses on detecting fake news in Dravidian languages using two tasks: Social Media Text Classification and News Classification. We employ a fine-tuned multilingual BERT (mBERT) model for classifying a given social media text into original or fake and an LSTM-based architecture for accurately detecting and classifying fake news articles in the Malayalam language into different categories.Extensive preprocessing techniques, such as tokenization and text cleaning, were used to ensure data quality. Our experiments achieved significant accuracy rates and F1- scores. The study’s contributions include applying advanced machine learning techniques to the Malayalam language, addressing the lack of research on low-resource languages, and highlighting the challenges of fake news detection in multilingual and code-mixed environments.
pdf
bib
abs
NLP_goats_DravidianLangTech_2025__Detecting_AI_Written_Reviews_for_Consumer_Trust
Srihari V K
|
Vijay Karthick Vaidyanathan
|
Mugilkrishna D U
|
Thenmozhi Durairaj
The rise of AI-generated content has introduced challenges in distinguishing machine-generated text from human-written text, particularly in low-resource languages. The identification of artificial intelligence (AI)-based reviews is of significant importance to preserve trust and authenticity on online platforms. The Shared Task on Detecting AI-Generated Product Reviews in Dravidian languages deals with the task of detecting AI-generated and human-written reviews in Tamil and Malayalam. To solve this problem, we specifically fine-tuned mBERT for binary classification. Our system achieved 10th place in Tamil with a macro F1-score of 0.90 and 28th place in Malayalam with a macro F1-score of 0.68, as reported by the NAACL 2025 organizers. The findings demonstrate the complexity involved in the separation of AI-derived text from human-authored writing, with a call for continued advances in detection methods.
pdf
bib
abs
RATHAN@DravidianLangTech 2025: Annaparavai - Separate the Authentic Human Reviews from AI-generated one
Jubeerathan Thevakumar
|
Luheerathan Thevakumar
Detecting AI-generated reviews is crucial for maintaining the authenticity of online feedback in low-resource languages like Tamil and Malayalam. We propose a transfer learning-based approach using embeddings from XLM-RoBERTa, IndicBERT, mT5, and Sentence-BERT, validated with five-fold cross-validation via XGBoost. These embeddings are used to train deep neural networks (DNNs), refined through a weighted ensemble model. Our method achieves 90% F1-score for Malayalam and 73% for Tamil, demonstrating the effectiveness of transfer learning and ensembling for review detection. The source code is publicly available to support further research and improve online review systems in multilingual settings.
pdf
bib
abs
DLRG@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Ratnavel Rajalakshmi
|
Ramesh Kannan
|
Meetesh Saini
|
Bitan Mallik
Social media is a powerful communication tooland rich in diverse content requiring innovativeapproaches to understand nuances of the lan-guages. Addressing challenges like hate speechnecessitates multimodal analysis that integratestextual, and other cues to capture its contextand intent effectively. This paper proposes amultimodal hate speech detection system inTamil, which uses textual and audio featuresfor classification. Our proposed system usesa fine-tuned Indic-BERT model for text basedhate speech detection and Wav2Vec2 modelfor audio based hate speech detection of au-dio data. The fine-tuned Indic-BERT modelwith Whisper achieved an F1 score of 0.25 onMultimodal approach. Our proposed approachranked at 10th position in the shared task onMultimodal Hate Speech Detection in Dravid-ian languages at the NAACL 2025 WorkshopDravidianLangTech.
pdf
bib
abs
Team ML_Forge@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Adnan Faisal
|
Shiti Chowdhury
|
Sajib Bhattacharjee
|
Udoy Das
|
Samia Rahman
|
Momtazul Arefin Labib
|
Hasan Murad
Ensuring a safe and inclusive online environment requires effective hate speech detection on social media. While detection systems have significantly advanced for English, many regional languages, including Malayalam, Tamil and Telugu, remain underrepresented, creating challenges in identifying harmful content accurately. These languages present unique challenges due to their complex grammar, diverse dialects, and frequent code-mixing with English. The rise of multimodal content, including text and audio, adds further complexity to detection tasks. The shared task “Multimodal Hate Speech Detection in Dravidian Languages: DravidianLangTech@NAACL 2025” has aimed to address these challenges. A Youtube-sourced dataset has been provided, labeled into five categories: Gender (G), Political (P), Religious (R), Personal Defamation (C) and Non-Hate (NH). In our approach, we have used mBERT, T5 for text and Wav2Vec2 and Whisper for audio. T5 has performed poorly compared to mBERT, which has achieved the highest F1 scores on the test dataset. For audio, Wav2Vec2 has been chosen over Whisper because it processes raw audio effectively using self-supervised learning. In the hate speech detection task, we have achieved a macro F1 score of 0.2005 for Malayalam, ranking 15th in this task, 0.1356 for Tamil and 0.1465 for Telugu, with both ranking 16th in this task.
pdf
bib
abs
codecrackers@DravidianLangTech 2025: Sentiment Classification in Tamil and Tulu Code-Mixed Social Media Text Using Machine Learning
Lalith Kishore V P
|
Dr G Manikandan
|
Mohan Raj M A
|
Keerthi Vasan A
|
Aravindh M
Sentiment analysis of code-mixed Dravidian languages has become a major area of concern with increasing volumes of multilingual and code-mixed information across social media. This paper presents the “Seventh Shared Task on Sentiment Analysis in Code-mixed Tamil and Tulu”, which was held as part of DravidianLangTech (NAACL-2025). However, sentiment analysis for code-mixed Dravidian languages has received little attention due to challenges such as class imbalance, small sample size, and the informal nature of the code-mixed text. This study applied an SVM-based approach for the sentiment classification of both Tamil and Tulu languages. The SVM model achieved competitive macro-average F1 scores of 0.54 for Tulu and 0.438 for Tamil, demonstrating that traditional machine learning methods can effectively tackle sentiment categorization in code-mixed languages under low-resource settings.
pdf
bib
abs
CUET_Ignite@DravidianLangTech 2025: Detection of Abusive Comments in Tamil Text Using Transformer Models
MD.Mahadi Rahman
|
Mohammad Minhaj Uddin
|
Mohammad Shamsul Arefin
Abusive comment detection in low-resource languages is a challenging task particularly when addressing gender-based abuse. Identifying abusive language targeting women is crucial for effective content moderation and fostering safer online spaces. A shared task on abusive comment detection in Tamil text organized by DravidianLangTech@NAACL 2025 allowed us to address this challenge using a curated dataset. For this task, we experimented with various machine learning (ML) and deep learning (DL) models including Logistic Regression, Random Forest, SVM, CNN, LSTM, BiLSTMand transformer-based models such as mBERT, IndicBERT, XLMRoBERTa and many more. The dataset comprised of Tamil YouTube comments annotated with binary labels, Abusive and NonAbusive capturing explicit abuse, implicit biases and stereotypes. Our experiments demonstrated that XLM-RoBERTa achieved the highest macro F1-score(0.80), highlighting its effectiveness in handling Tamil text. This research contributes to advancing abusive language detection and natural language processing in lowresource languages particularly for addressing gender-based abuse online.
pdf
bib
abs
CUET_Absolute_Zero@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Malayalam and Tamil Language Using Transformer Models
Anindo Barua
|
Sidratul Muntaha
|
Momtazul Arefin Labib
|
Samia Rahman
|
Udoy Das
|
Hasan Murad
Artificial Intelligence (AI) is opening new doors of learning and interaction. However, it has its share of problems. One major issue is the ability of AI to generate text that resembles human-written text. So, how can we tell apart human-written text from AI-generated text?With this in mind, we have worked on detecting AI-generated product reviews in Dravidian languages, mainly in Malayalam and Tamil. The “Shared Task on Detecting AI-Generated Product Reviews in Dravidian Languages,” held as part of the DravidianLangTech Workshop at NAACL 2025 has provided a dataset categorized into two categories, human-written review and AI-generated review. We have implemented four machine learning models (Random Forest, Support Vector Machine, Decision Tree, and XGBoost), four deep learning models (Long Short-Term Memory, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and Recurrent Neural Network), and three transformer-based models (AI-Human-Detector, Detect-AI-Text, and E5-Small-Lora-AI-Generated-Detector). We have conducted a comparative study among all the models by training and evaluating each model on the dataset. We have discovered that the transformer, E5-Small-Lora-AI-Generated-Detector, has provided the best result with an F1 score of 0.8994 on the test set ranking 7th position in the Malayalam language. Tamil has a higher token overlap and richer morphology than Malayalam. Thus, we obtained a worse F1 score of 0.5877 ranking 28th position in the Tamil language among all participants in the shared task.
pdf
bib
abs
MNLP@DravidianLangTech 2025: Transformers vs. Traditional Machine Learning: Analyzing Sentiment in Tamil Social Media Posts
Abhay Vishwakarma
|
Abhinav Kumar
Sentiment analysis in Natural Language Processing (NLP) aims to categorize opinions in text. In the political domain, understanding public sentiment is crucial for influencing policymaking. Social media platforms like X (Twitter) provide abundant sources of real-time political discourse. This study focuses on political multiclass sentiment analysis of Tamil comments from X, classifying sentiments into seven categories: substantiated, sarcastic, opinionated, positive, negative, neutral, and none of the above. A number of traditional machine learning such as Naive Bayes, Voting Classifier (an ensemble of Decision Tree, SVM, Naive Bayes, K-Nearest Neighbors, and Logistic Regression) and deep learning models such as LSTM, deBERTa, and a hybrid approach combining deBERTa embeddings with an LSTM layer are implemented. The proposed ensemble-based voting classifier achieved best performance among all implemented models with an accuracy of 0.3750, precision of 0.3387, recall of 0.3250, and macro-F1-score of 0.3227.
pdf
bib
abs
shimig@DravidianLangTech2025: Stratification of Abusive content on Women in Social Media
Gersome Shimi
|
Jerin Mahibha C
|
Thenmozhi Durairaj
The social network is a trending medium for interaction and sharing content globally. The content is sensitive since it can create an impact and change the trends of stakeholder’s thought as well as behavior. When the content is targeted towards women, it may be abusive or non-abusive and the identification is a tedious task. The content posted on social networks can be in English, code mix, or any low-resource language. The shared task Abusive Tamil and Malayalam Text targeting Women on Social Media was conducted as part of DravidianLangTech@NAACL 2025 organized by DravidianLangTech. The task is to identify the content given in Tamil or Malayalam or code mix as abusive or non-abusive. The task is accomplished for the South Indian languages Tamil and Malayalam using pretrained transformer model, BERT base multilingual cased and achieved the accuracy measure of 0.765 and 0.677.
pdf
bib
abs
SSNTrio@DravidianLangTech2025: LLM Based Techniques for Detection of Abusive Text Targeting Women
Mirnalinee T T
|
J Bhuvana
|
Avaneesh Koushik
|
Diya Seshan
|
Rohan R
This study focuses on developing a solution for detecting abusive texts on social media against women in Tamil and Malayalam, two low-resource Dravidian languages in South India. As the usage of social media for communication and idea sharing has increased significantly, these platforms are being used to target and victimize women. Hence an automated solution becomes necessary to screen the huge volume of content generated. This work is part of the shared Task on Abusive Tamil and Malayalam Text targeting Women on Social MediaDravidianLangTech@NAACL 2025. The approach used to tackle this problem involves utilizing LLM based techniques for classifying abusive text. The Macro Average F1-Score for the Tamil BERT model was 0.76 securing the 11th position, while the Malayalam BERT model for Malayalam obtained a score of 0.30 and secured the 33rd rank. The proposed solution can be extended further to incorporate other regional languages as well based on similar techniques.
pdf
bib
abs
CUET-NLP_MP@DravidianLangTech 2025: A Transformer and LLM-Based Ensemble Approach for Fake News Detection in Dravidian
Md Minhazul Kabir
|
Md. Mohiuddin
|
Kawsar Ahmed
|
Mohammed Moshiul Hoque
Fake news detection is a critical problem in today’s digital age, aiming to classify intentionally misleading or fabricated news content. In this study, we present a transformer and LLM-based ensemble method to address the challenges in fake news detection. We explored various machine learning (ML), deep learning (DL), transformer, and LLM-based approaches on a Malayalam fake news detection dataset. Our findings highlight the difficulties faced by traditional ML and DL methods in accurately detecting fake news, while transformer- and LLM-based ensemble methods demonstrate significant improvements in performance. The ensemble method combining Sarvam-1, Malayalam-BERT, and XLM-R outperformed all other approaches, achieving an F1-score of 89.30% on the given dataset. This accomplishment, which contributed to securing 2nd place in the shared task at DravidianLangTech 2025, underscores the importance of developing effective methods for detecting fake news in Dravidian languages.
pdf
bib
abs
CUET-NLP_Big_O@DravidianLangTech 2025: A Multimodal Fusion-based Approach for Identifying Misogyny Memes
Md. Refaj Hossan
|
Nazmus Sakib
|
Md. Alam Miah
|
Jawad Hossain
|
Mohammed Moshiul Hoque
Memes have become one of the main mediums for expressing ideas, humor, and opinions through visual-textual content on social media. The same medium has been used to propagate harmful ideologies, such as misogyny, that undermine gender equality and perpetuate harmful stereotypes. Identifying misogynistic memes is particularly challenging in low-resource languages (LRLs), such as Tamil and Malayalam, due to the scarcity of annotated datasets and sophisticated tools. Therefore, DravidianLangTech@NAACL 2025 launched a Shared Task on Misogyny Meme Detection to identify misogyny memes. For this task, this work exploited an extensive array of models, including machine learning (LR, RF, SVM, and XGBoost), and deep learning (CNN, BiLSTM+CNN, CNN+GRU, and LSTM) are explored to extract textual features, while CNN, BiLSTM + CNN, ResNet50, and DenseNet121 are utilized for visual features.Furthermore, we have explored feature-level and decision-level fusion techniques with several model combinations like MuRIL with ResNet50, MuRIL with BiLSTM+CNN, T5+MuRIL with ResNet50, and mBERT with ResNet50. The evaluation results demonstrated that BERT + ResNet50 performed best, obtaining an F1 score of 0.81716 (Tamil) and were ranked 2nd in the task. The early fusion of MuRIL+ResNet50 showed the highest F1 score of 0.82531 and received a 9th rank in Malayalam.
pdf
bib
abs
LexiLogic@DravidianLangTech 2025: Detecting Misogynistic Memes and Abusive Tamil and Malayalam Text Targeting Women on Social Media
Niranjan Kumar M
|
Pranav Gupta
|
Billodal Roy
|
Souvik Bhattacharyya
Social media platforms have become a significant medium for communication and expression, but they are also plagued by misogynistic content targeting women. This study focuses on detecting misogyny in memes and abusive textual content in Tamil and Malayalam languages, which are underrepresented in natural language processing research. Leveraging advanced machine learning and deep learning techniques, we developed a system capable of identifying misogynistic memes and abusive text. By addressing cultural and linguistic nuances, our approach enhances detection accuracy and contributes to safer online spaces for women. This work also serves as a foundation for expanding misogyny detection to other low-resource languages, fostering inclusivity and combating online abuse effectively.This paper presents our work on detecting misogynistic memes and abusive Tamil and Malayalam text targeting women on social media platforms. Leveraging the pretrained models l3cube-pune/tamil-bert and l3cube-pune/malayalam-bert, we explored various data cleaning and augmentation strategies to enhance detection performance. The models were fine-tuned on curated datasets and evaluated using accuracy, F1-score, precision, and recall. The results demonstrated significant improvements with our cleaning and augmentation techniques, yielding robust performance in detecting nuanced and culturally-specific abusive content.Our model achieved macro F1 scores of 77.83/78.24 on L3Cube-Bert-Tamil and 78.16/77.01 on L3Cube-Bert-Malayalam, ranking 3rd and 4th on the leaderboard. For the misogyny task, we obtained 83.58/82.94 on L3Cube-Bert-Malayalam and 73.16/73.8 on L3Cube-Bert-Tamil, placing 9th in both. These results highlight our model’s effectiveness in low-resource language classification.
pdf
bib
abs
CUET-NLP_Big_O@DravidianLangTech 2025: A BERT-based Approach to Detect Fake News from Malayalam Social Media Texts
Nazmus Sakib
|
Md. Refaj Hossan
|
Alamgir Hossain
|
Jawad Hossain
|
Mohammed Moshiul Hoque
The rapid growth of digital platforms and social media has significantly contributed to spreading fake news, posing serious societal challenges. While extensive research has been conducted on detecting fake news in high-resource languages (HRLs) such as English, relatively little attention has been given to low-resource languages (LRLs) like Malayalam due to insufficient data and computational tools. To address this challenge, the DravidianLangTech 2025 workshop organized a shared task on fake news detection in Dravidian languages. The task was divided into two sub-tasks, and our team participated in Task 1, which focused on classifying social media texts as original or fake. We explored a range of machine learning (ML) techniques, including Logistic Regression (LR), Multinomial Naïve Bayes (MNB), and Support Vector Machines (SVM), as well as deep learning (DL) models such as CNN, BiLSTM, and a hybrid CNN+BiLSTM. Additionally, this work examined several transformer-based models, including m-BERT, Indic-BERT, XLM-Roberta, and MuRIL-BERT, to exploit the task. Our team achieved 6th place in Task 1, with MuRIL-BERT delivering the best performance, achieving an F1 score of 0.874.
pdf
bib
abs
LexiLogic@DravidianLangTech 2025: Detecting Fake News in Malayalam and AI-Generated Product Reviews in Tamil and Malayalam
Souvik Bhattacharyya
|
Pranav Gupta
|
Niranjan Kumar M
|
Billodal Roy
Fake news and hard-to-detect AI-generated content are pressing issues in online media, which are expected to exacerbate due to the recent advances in generative AI. Moreover, tools to keep such content under check are less accurate for languages with less available online data. In this paper, we describe our submissions to two shared tasks at the NAACL Dravidian Language Tech workshop, namely detecting fake news in Malayalam and detecting AI-generated product reviews in Malayalam and Tamil. We obtained test macro F1 scores of 0.29 and 0.82 in the multi-class and binary classification sub-tasks within the Malayalam fake news task, and test macro F1 scores of 0.9 and 0.646 in the task of detecting AI-generated product reviews in Malayalam and Tamil respectively.
pdf
bib
abs
SSNTrio @ DravidianLangTech 2025: Hybrid Approach for Hate Speech Detection in Dravidian Languages with Text and Audio Modalities
J Bhuvana
|
Mirnalinee T T
|
Rohan R
|
Diya Seshan
|
Avaneesh Koushik
This paper presents the approach and findings from the Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL) shared task at DravidianLangTech@NAACL 2025. The task focuses on detecting multimodal hate speech in Tamil, Malayalam, and Telugu, requiring models to analyze both text and speech components from social media content. The proposed methodology uses language-specific BERT models for the provided text transcripts, followed by multimodal feature extraction techniques, and classification using a Random Forest classifier to enhance performance across the three languages. The models achieved a macro-F1 score of 0.7332 (Rank 1) in Tamil, 0.7511 (Rank 1) in Malayalam, and 0.3758 (Rank 2) in Telugu, demonstrating the effectiveness of the approach in multilingual settings. The models performed well despite the challenges posed by limited resources, highlighting the potential of language-specific BERT models and multimodal techniques in hate speech detection for Dravidian languages.
pdf
bib
abs
Fired_from_NLP@DravidianLangTech 2025: A Multimodal Approach for Detecting Misogynistic Content in Tamil and Malayalam Memes
Md. Sajid Alam Chowdhury
|
Mostak Mahmud Chowdhury
|
Anik Mahmud Shanto
|
Jidan Al Abrar
|
Hasan Murad
In the context of online platforms, identifying misogynistic content in memes is crucial for maintaining a safe and respectful environment. While most research has focused on high-resource languages, there is limited work on languages like Tamil and Malayalam. To address this gap, we have participated in the Misogyny Meme Detection task organized by DravidianLangTech@NAACL 2025, utilizing the provided dataset named MDMD (Misogyny Detection Meme Dataset), which consists of Tamil and Malayalam memes. In this paper, we have proposed a multimodal approach combining visual and textual features to detect misogynistic content. Through a comparative analysis of different model configurations, combining various deep learning-based CNN architectures and transformer-based models, we have developed fine-tuned multimodal models that effectively identify misogynistic memes in Tamil and Malayalam. We have achieved an F1 score of 0.678 for Tamil memes and 0.803 for Malayalam memes.
pdf
bib
abs
One_by_zero@DravidianLangTech 2025: Fake News Detection in Malayalam Language Leveraging Transformer-based Approach
Dola Chakraborty
|
Shamima Afroz
|
Jawad Hossain
|
Mohammed Moshiul Hoque
The rapid spread of misinformation in the digital era presents critical challenges for fake news detection, especially in low-resource languages (LRLs) like Malayalam, which lack extensive datasets and pre-trained models for widely spoken languages. This gap in resources makes it harder to build robust systems for combating misinformation despite the significant societal and political consequences it can have. To address these challenges, this work proposes a transformer-based approach for Task 1 of the Fake News Detection in Dravidian Languages (DravidianLangTech@NAACL 2025), which focuses on classifying Malayalam social media texts as either original or fake. The experiments involved a range of ML techniques (Logistic Regression (LR), Support Vector Machines (SVM), and Decision Trees (DT)) and DL architectures (BiLSTM, BiLSTM-LSTM, and BiLSTM-CNN). This work also explored transformer-based models, including IndicBERT, MuRiL, XLM-RoBERTa, and Malayalam BERT. Among these, Malayalam BERT achieved the best performance, with the highest macro F1-score of 0.892, securing a rank of 3rd in the competition.
pdf
bib
abs
CUET_Novice@DravidianLangTech 2025: A Multimodal Transformer-Based Approach for Detecting Misogynistic Memes in Malayalam Language
Khadiza Sultana Sayma
|
Farjana Alam Tofa
|
Md Osama
|
Ashim Dey
Memes, combining images and text, are a popular social media medium that can spread humor or harmful content, including misogyny—hatred or discrimination against women. Detecting misogynistic memes in Malayalam is challenging due to their multimodal nature, requiring analysis of both visual and textual elements. A Shared Task on Misogyny Meme Detection, organized as part of DravidianLangTech@NAACL 2025, aimed to address this issue by promoting the advancement of multimodal machine learning models for classifying Malayalam memes as misogynistic or non-misogynistic. In this work, we explored visual, textual, and multimodal approaches for meme classification. CNN, ResNet50, Vision Transformer (ViT), and Swin Transformer were used for visual feature extraction, while mBERT, IndicBERT, and MalayalamBERT were employed for textual analysis. Additionally, we experimented with multimodal fusion models, including IndicBERT+ViT, MalayalamBERT+ViT, and MalayalamBERT+Swin. Among these, our MalayalamBERT+Swin Transformer model performed best, achieving the highest weighted F1-score of 0.87631, securing 1st place in the competition. Our results highlight the effectiveness of multimodal learning in detecting misogynistic Malayalam memes and the need for robust AI models in low-resource languages.
pdf
bib
abs
teamiic@DravidianLangTech2025-NAACL 2025: Transformer-Based Multimodal Feature Fusion for Misogynistic Meme Detection in Low-Resource Dravidian Language
Harshita Sharma
|
Simran Simran
|
Vajratiya Vajrobol
|
Nitisha Aggarwal
Misogyny has become a pervasive issue in digital spaces. Misleading gender stereotypes are getting communicated through digital content.This content is majorly displayed as a text-and-image memes. With the growing prevalence of online content, it is essential to develop automated systems capable of detecting such harmful content to ensure safer online environments. This study focuses on the detection of misogynistic memes in two Dravidian languages, Tamil and Malayalam. The proposed model utilizes a pre-trained XLM-RoBERTa (XLM-R) model for text analysis and a Vision Transformer (ViT) for image feature extraction. A custom neural network classifier was trained on integrating the outputs of both modalities to form a unified representation. This model predicts whether the meme represents misogyny or not. This follows an early-fusion strategy since features of both modalities are combined before feeding into the classification model. This approach achieved promising results using a macro F1-score of 0.84066 on the Malayalam test dataset and 0.68830 on the Tamil test dataset. In addition, it is worth noting that this approach secured Rank 7 and 11 in Malayalam and Tamil classification respectively in the shared task of Misogyny Meme Detection (MMD). The findings demonstrate that the multimodal approach significantly enhances the accuracy of detecting misogynistic content compared to text-only or image-only models.
pdf
bib
abs
CUET_Novice@DravidianLangTech 2025: Abusive Comment Detection in Malayalam Text Targeting Women on Social Media Using Transformer-Based Models
Farjana Alam Tofa
|
Khadiza Sultana Sayma
|
Md Osama
|
Ashim Dey
Social media has become a widely used platform for communication and entertainment, but it has also become a space where abuseand harassment can thrive. Women, in particular, face hateful and abusive comments that reflect gender inequality. This paper discussesour participation in the Abusive Text Targeting Women in Dravidian Languages shared task at DravidianLangTech@NAACL 2025, whichfocuses on detecting abusive text targeting women in Malayalam social media comments. The shared task provided a dataset of YouTubecomments in Tamil and Malayalam, focusing on sensitive and controversial topics where abusive behavior is prevalent. Our participationfocused on the Malayalam dataset, where the goal was to classify comments into these categories accurately. Malayalam-BERT achievedthe best performance on the subtask, securing 3rd place with a macro f1-score of 0.7083, highlighting the effectiveness of transformer models for low-resource languages. These results contribute to tackling gender-based abuse and improving online content moderation for underrepresented languages.
pdf
bib
abs
SemanticCuetSync@DravidianLangTech 2025: Multimodal Fusion for Hate Speech Detection - A Transformer Based Approach with Cross-Modal Attention
Md. Sajjad Hossain
|
Symom Hossain Shohan
|
Ashraful Islam Paran
|
Jawad Hossain
|
Mohammed Moshiul Hoque
The rise of social media has significantly facilitated the rapid spread of hate speech. Detecting hate speech for content moderation is challenging, especially in low-resource languages (LRLs) like Telugu. Although some progress has been noticed in hate speech detection in Telegu concerning unimodal (text or image) in recent years, there is a lack of research on hate speech detection based on multimodal content detection (specifically using audio and text). In this regard, DravidianLangTech has arranged a shared task to address this challenge. This work explored three machine learning (ML), three deep learning (DL), and seven transformer-based models that integrate text and audio modalities using cross-modal attention for hate speech detection. The evaluation results demonstrate that mBERT achieved the highest F-1 score of 49.68% using text. However, the proposed multimodal attention-based approach with Whisper-small+TeluguBERT-3 achieved an F-1 score of 43 68%, which helped us achieve a rank of 3rd in the shared task competition.
pdf
bib
abs
CUET_Novice@DravidianLangTech 2025: A Bi-GRU Approach for Multiclass Political Sentiment Analysis of Tamil Twitter (X) Comments
Arupa Barua
|
Md Osama
|
Ashim Dey
Political sentiment analysis in multilingual content poses significant challenges in capturing the subtle variations of diverse sentiments expressed in complex and low-resourced languages. Accurately classifying sentiments, whether positive, negative, or neutral, is crucialfor understanding public discourse. A shared task on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments, organized by DravidianLangTech@NAACL 2025, provided an opportunity to tackle these challenges. For this task, we implemented two data augmentation techniques, which are synonym replacement and back translation, and then explored various machine learning (ML) algorithms, including Logistic Regression, Decision Tree, Random Forest, SVM, and MultiNomial Naive Bayes. To capture the semantic meanings more efficiently, we experimented with deep learning (DL) models, including GRU, BiLSTM, BiGRU, and a hybrid CNN-BiLSTM.The Bidirectional Gated Recurrent Unit (BiGRU) achieved the best macro-F1 (MF1) score of 0.33, securing the 17th position in the sharedtask. These findings underscore the challenges of political sentiment analysis in low-resource languages and the need for advanced language-specific models for improved classification.
pdf
bib
abs
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Tewodros Achamaleh
|
Tolulope Olalekan Abiola
|
Lemlem Eyob Kawo
|
Mikiyas Mebraihtu
|
Grigori Sidorov
AI-generated text now matches human writing so well that telling them apart is very difficult. Our CIC-NLP team submits results for the DravidianLangTech@NAACL 2025 shared task to reveal AI-generated product reviews in Dravidian languages. We performed a binary classification task with XLM-RoBERTa-Base using the DravidianLangTech@NAACL 2025 datasets offered by the event organizers. Through training the model correctly, our tests could tell between human and AI-generated reviews with scores of 0.96 for Tamil and 0.88 for Malayalam in the evaluation test set. This paper presents detailed information about preprocessing, model architecture, hyperparameter fine-tuning settings, the experimental process, and the results. The source code is available on GitHub1.
pdf
bib
abs
One_by_zero@DravidianLangTech 2025: A Multimodal Approach for Misogyny Meme Detection in Malayalam Leveraging Visual and Textual Features
Dola Chakraborty
|
Shamima Afroz
|
Jawad Hossain
|
Mohammed Moshiul Hoque
Misogyny memes are a form of online content that spreads harmful and damaging ideas about women. By combining images and text, they often aim to mock, disrespect, or insult women, sometimes overtly and other times in more subtle, insidious ways. Detecting Misogyny memes is crucial for fostering safer and more respectful online communities. While extensive research has been conducted on high-resource languages (HRLs) like English, low-resource languages (LRLs) such as Dravidian (e.g., Tamil and Malayalam) remain largely overlooked. The shared task on Misogyny Meme Detection, organized as part of DravidianLangTech@NAACL 2025, provided a platform to tackle the challenge of identifying misogynistic content in memes, specifically in Malayalam. We participated in the competition and adopted a multimodal approach to contribute to this effort. For image analysis, we employed a ResNet18 model to extract visual features, while for text analysis, we utilized the IndicBERT model. Our system achieved an impressive F1-score of 0.87, earning us the 3rd rank in the task.
pdf
bib
abs
CUET-NLP_MP@DravidianLangTech 2025: A Transformer-Based Approach for Bridging Text and Vision in Misogyny Meme Detection in Dravidian Languages
Md. Mohiuddin
|
Md Minhazul Kabir
|
Kawsar Ahmed
|
Mohammed Moshiul Hoque
Misogyny memes, a form of digital content, reflect societal prejudices by discriminating against women through shaming and stereotyping. In this study, we present a multimodal approach combining Indic-BERT and ViT-base-patch16-224 to address misogyny memes. We explored various Machine Learning, Deep Learning, and Transformer models for unimodal and multimodal classification using provided Tamil and Malayalam meme dataset. Our findings highlight the challenges traditional ML and DL models face in understanding the nuances of Dravidian languages, while emphasizing the importance of transformer models in capturing these complexities. Our multimodal method achieved F1-scores of 77.18% and 84.11% in Tamil and Malayalam, respectively, securing 6th place for both languages among the participants.
pdf
bib
abs
CUET_NetworkSociety@DravidianLangTech 2025: A Transformer-Based Approach for Detecting AI-Generated Product Reviews in Low-Resource Dravidian Languages
Sabik Aftahee
|
Tofayel Ahmmed Babu
|
MD Musa Kalimullah Ratul
|
Jawad Hossain
|
Mohammed Moshiul Hoque
E-commerce platforms face growing challenges regarding consumer trust and review authenticity because of the growing number of AI-generated product reviews. Low-resource languages (LRLs) such as Tamil and Malayalam face limited investigation by AI detection techniques because these languages experience constraints from sparse data sources and complex linguistic structures. The research team at CUET_NetworkSociety took part in the AI-Generated Review Detection contest during the DravidianLangTech@NAACL 2025 event to fill this knowledge void. Using a combination of machine learning, deep learning, and transformer-based models, we detected AI-generated and human-written reviews in both Tamil and Malayalam. The developed method employed DistilBERT, which underwent an advanced preprocessing pipeline and hyperparameter optimization using the Transformers library. This approach achieved a Macro F1-score of 0.81 for Tamil (Subtask 1), securing 18th place, and a score of 0.7287 for Malayalam (Subtask 2), ranking 25th.
pdf
bib
abs
CUET_NetworkSociety@DravidianLangTech 2025: A Multimodal Framework to Detect Misogyny Meme in Dravidian Languages
MD Musa Kalimullah Ratul
|
Sabik Aftahee
|
Tofayel Ahmmed Babu
|
Jawad Hossain
|
Mohammed Moshiul Hoque
Memes are commonly used for communication on social media platforms, and some of them can propagate misogynistic content, spreading harmful messages. Detecting such misogynistic memes has become a significant challenge, especially for low-resource languages like Tamil and Malayalam, due to their complex linguistic structures. To tackle this issue, a shared task on detecting misogynistic memes was organized at DravidianLangTech@NAACL 2025. This paper proposes a multimodal deep learning approach for detecting misogynistic memes in Tamil and Malayalam. The proposed model combines fine-tuned ResNet18 for visual feature extraction and indicBERT for analyzing textual content. The fused model was applied to the test dataset, achieving macro F1 scores of 76.32% for Tamil and 80.35% for Malayalam. Our approach led to 7th and 12th positions for Tamil and Malayalam, respectively.
pdf
bib
abs
CUET_NetworkSociety@DravidianLangTech 2025: A Transformer-Driven Approach to Political Sentiment Analysis of Tamil X (Twitter) Comments
Tofayel Ahmmed Babu
|
MD Musa Kalimullah Ratul
|
Sabik Aftahee
|
Jawad Hossain
|
Mohammed Moshiul Hoque
Social media has become an established medium of public communication and opinions on every aspect of life, but especially politics. This has resulted in a growing need for tools that can process the large amount of unstructured data that is produced on these platforms providing actionable insights in domains such as social trends and political opinion. Low-resource languages like Tamil present challenges due to limited tools and annotated data, highlighting the need for NLP focus on understudied languages. To address this, a shared task has been organized by DravidianLangTech@NAACL 2025 for political sentiment analysis for low-resource languages, with a specific focus on Tamil. In this task, we have explored several machine learning methods such as SVM, AdaBoost, GB, deep learning methods including CNN, LSTM, GRU BiLSTM, and the ensemble of different deep learning models, and transformer-based methods including mBERT, T5, XLM-R. The mBERT model performed best by achieving a macro F1 score of 0.2178 and placing our team 22nd in the rank list.
pdf
bib
abs
cantnlp@DravidianLangTech-2025: A Bag-of-Sounds Approach to Multimodal Hate Speech Detection
Sidney Wong
|
Andrew Li
This paper presents the systems and results for the Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL) shared task at the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2025). We took a ‘bag-of-sounds’ approach by training our hate speech detection system on the speech (audio) data using transformed Mel spectrogram measures. While our candidate model performed poorly on the test set, our approach offered promising results during training and development for Malayalam and Tamil. With sufficient and well-balanced training data, our results show that it is feasible to use both text and speech (audio) data in the development of multimodal hate speech detection systems.
pdf
bib
abs
LexiLogic@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Billodal Roy
|
Pranav Gupta
|
Souvik Bhattacharyya
|
Niranjan Kumar M
This paper describes our participation in the DravidianLangTech@NAACL 2025 shared task on hate speech detection in Dravidian languages. While the task provided both text transcripts and audio data, we demonstrate that competitive results can be achieved using text features alone. We employed fine-tuned Bidirectional Encoder Representations from Transformers (BERT) models from l3cube-pune for Malayalam, Tamil, and Telugu languages. Our system achieved notable results, securing second position for Tamil and Malayalam tasks, and first position for Telugu in the official leaderboard.
pdf
bib
abs
LexiLogic@DravidianLangTech 2025: Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments and Sentiment Analysis in Tamil and Tulu
Billodal Roy
|
Souvik Bhattacharyya
|
Pranav Gupta
|
Niranjan Kumar M
We present our approach and findings for two sentiment analysis shared tasks as part of DravidianLangTech@NAACL 2025. The first task involved a seven-class political sentiment classification for Tamil tweets, while the second addressed code-mixed sentiment analysis in Tamil-English and Tulu-English social media texts. We employed language-specific BERT models fine-tuned on the respective tasks, specifically utilizing the L3Cube-Tamil-BERT for Tamil classification and a Telugu-based BERT model for Tulu classification. Our system achieved notable results, particularly securing the first position in the Tulu code-mixed sentiment analysis track. The experiments demonstrate the effectiveness of language-specific pre-trained models for Dravidian language sentiment analysis, while also highlighting the challenges in handling political discourse and code-mixed content.
pdf
bib
abs
Detection of Religious Hate Speech During Elections in Karnataka
Msvpj Sathvik
|
Raj Sonani
|
Ravi Teja Potla
We propose a novel dataset for detecting religious hate speech in the context of elections in Karnataka, with a particular focus on Kannada and Kannada-English code-mixed text. The data was collected during the Karnataka state elections and includes 3,000 labeled samples that reflect various forms of online discourse related to religion. This dataset aims to address the growing concern of religious intolerance and hate speech during election periods, it’s a dataset of multilingual, code-mixed language. To evaluate the effectiveness of this dataset, we benchmarked it using the latest state-of-the-art algorithms. We achieved accuracy of 78.61%.
pdf
bib
abs
DLTCNITPY@DravidianLangTech 2025 Abusive Code-mixed Text Detection System Targeting Women for Tamil and Malayalam Languages using Deep Learning Technique
Habiba A
|
Dr G Aghila
The growing use of social communication platforms has seen women facing higher degrees of online violence than ever before. This paper presents how a deep learning abuse detection system can be applied to inappropriate text directed at women on social media. Because of the diversity of languages and the casual nature of online communication, coupled with the cultural diversity around the world, the detection of such content is often severely lacking. This research utilized Long Short-Term Memory (LSTM) for abuse text detection in Malayalam and Tamil languages. This modeldelivers 0.75, a high F1 score for Malayalam, and for Tamil, 0.72, achieving the desired balance of identifying abuse and non-abusive content and achieving high-performance rates. The designed model, based on the dataset provided in DravidianLangTech@NAACL2025 (shared task) comprising code-mixed abusive and nonabusive social media posts in Malayalam and Tamil, showcases a high propensity for detecting accuracy and indicates the likely success of deep learning-based models for abuse textdetection in resource-constrained languages.
pdf
bib
abs
TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset
Aathavan Nithiyananthan
|
Jathushan Raveendra
|
Uthayasanker Thayasivam
A simile is a powerful figure of speech that makes a comparison between two different things via shared properties, often using words like “like” or “as” to create vivid imagery, convey emotions, and enhance understanding. However, computational research on similes is limited in low-resource languages like Tamil due to the lack of simile datasets. This work introduces a manually annotated Tamil Simile Dataset (TSD) comprising around 1.5k simile sentences drawn from various sources. Our data annotation guidelines ensure that all the simile sentences are annotated with the three components, namely tenor, vehicle, and context. We benchmark our dataset for simile interpretation and simile generation tasks using chosen pre-trained language models (PLMs) and present the results. Our findings highlight the challenges of simile tasks in Tamil, suggesting areas for further improvement. We believe that TSD will drive progress in computational simile processing for Tamil and other low-resource languages, further advancing simile related tasks in Natural Language Processing.
pdf
bib
abs
Hydrangea@DravidianLanTech2025: Abusive language Identification from Tamil and Malayalam Text using Transformer Models
Shanmitha Thirumoorthy
|
Thenmozhi Durairaj
|
Ratnavel Rajalakshmi
Abusive language toward women on the Internet has always been perceived as a danger to free speech and safe online spaces. In this paper, we discuss three transformer-based models - BERT, XLM-RoBERTa, and DistilBERT-in identifying gender-abusive comments in Tamil and Malayalam YouTube contents. We fine-tune and compare these models using a dataset provided by DravidianLangTech 2025 shared task for identifying the abusive content from social media. Compared to the models above, the results of XLM-RoBERTa are better and reached F1 scores of 0.7708 for Tamil and 0.6876 for Malayalam. BERT followed with scores of 0.7658 (Tamil) and 0.6671 (Malayalam). Of the DistilBERTs, performance was varyingly different for the different languages. A large difference in performance between the models, especially in the case of Malayalam, indicates that working in low-resource languages is difficult. The choice of a model is extremely critical in applying abusive language detection. The findings would be important information for effective content moderation systems in linguistically diverse contexts. In general, it would promote safe online spaces for women in South Indian language communities.
pdf
bib
abs
Towards Effective Emotion Analysis in Low-Resource Tamil Texts
Priyatharshan Balachandran
|
Uthayasanker Thayasivam
|
Randil Pushpananda
|
Ruvan Weerasinghe
Emotion analysis plays a significant role in understanding human behavior and communication, yet research in Tamil language remains limited. This study focuses on building an emotion classifier for Tamil texts using machine learning (ML) and deep learning (DL), along with creating an emotion-annotated Tamil corpus for Ekman’s basic emotions. Our dataset combines publicly available data with re-annotation and translations. Along with traditional ML models we investigated the use of Transfer Learning (TL) with state-of-the-art models, such as BERT and Electra based models. Experiments were conducted on unbalanced and balanced datasets using data augmentation techniques. The results indicate that MultinomialNaive Bayes (MNB) and Support Vector Machine (SVM) performed well with TF-IDF and BoW representations, while among Transfer Learning models, LaBSE achieved the highest accuracy (63% balanced, 69% unbalanced), followed by TamilBERT and IndicBERT.
pdf
bib
CUET_NLP_FiniteInfinity@DravidianLangTech 2025: Exploring Large Language Models for AI-Generated Product Review Classification in Malayalam
Md. Zahid Hasan
|
Safiul Alam Sarker
|
MD Musa Kalimullah Ratul
|
Kawsar Ahmed
|
Mohammed Moshiul Hoque
pdf
bib
abs
NAYEL@DravidianLangTech-2025: Character N-gram and Machine Learning Coordination for Fake News Detection in Dravidian Languages
Hamada Nayel
|
Mohammed Aldawsari
|
Hosahalli Lakshmaiah Shashirekha
This paper introduces the detailed description of the submitted model by the team NAYEL to Fake News Detection in Dravidian Languages shared task. The proposed model uses a simple character n-gram TF-IDF as a feature extraction approach integrated with an ensemble of various classical machine learning classification algorithms. While the simplicity of the proposed model structure, although it outperforms other complex structure models as the shared task results observed. The proposed model achieved a f1-score of 87.5% and secured the 5th rank.
pdf
bib
abs
AnalysisArchitects@DravidianLangTech 2025: BERT Based Approach For Detecting AI Generated Product Reviews In Dravidian Languages
Abirami Jayaraman
|
Aruna Devi Shanmugam
|
Dharunika Sasikumar
|
Bharathi B
The shared task on Detecting AI-generated Product Reviews in Dravidian Languages is aimed at addressing the growing concern of AI-generated product reviews, specifically in Malayalam and Tamil. As AI tools become more advanced, the ability to distinguish between human-written and AI-generated content has become increasingly crucial, especially in the domain of online reviews where authenticity is essential for consumer decision-making. In our approach, we used the ALBERT, IndicBERT, and Support Vector Machine (SVM) models to classify the reviews. The results of our experiments demonstrate the effectiveness of our methods in detecting AI-generated content.
pdf
bib
abs
AnalysisArchitects@DravidianLangTech 2025: Machine Learning Approach to Political Multiclass Sentiment Analysis of Tamil
Abirami Jayaraman
|
Aruna Devi Shanmugam
|
Dharunika Sasikumar
|
Bharathi B
Sentiment analysis is recognized as an important area in Natural Language Processing (NLP) that aims at understanding and classifying opinions or emotions in text. In the political field, public sentiment is analyzed to gain insight into opinions, address issues, and shape better policies. Social media platforms like Twitter (now X) are widely used to express thoughts and have become a valuable source of real-time political discussions. In this paper, the shared task of Political Multiclass Sentiment Analysis of Tamil tweets is examined, where the objective is to classify tweets into specific sentiment categories. The proposed approach is explained, which involves preprocessing Tamil text, extracting useful features, and applying machine learning and deep learning models for classification. The effectiveness of the methods is demonstrated through experimental results and the challenges encountered while working on the analysis of Tamil political sentiment are discussed.
pdf
bib
abs
TEAM_STRIKERS@DravidianLangTech2025: Misogyny Meme Detection in Tamil Using Multimodal Deep Learning
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Mohamed Arsath H
|
Ramya K
|
Ragav R
This study focuses on detecting misogynistic content in memes under the title Misogynistic. Meme Detection Using Multimodal Deep Learning. Through an analysis of both textual and visual components of memes, specifically in Tamil, the study seeks to detect misogynistic rhetoric directed towards women. Preprocessing and vectorizing text data using methods like TF-IDF, GloVe, Word2Vec, and transformer-based embeddings like BERT are all part of the textual analysis process. Deep learning models like ResNet and EfficientNet are used to extract significant image attributes for the visual component. To improve classification performance, these characteristics are then combined in a multimodal framework employing hybrid architectures such as CNN-LSTM, GRU-EfficientNet, and ResNet-BERT. The classification of memes as misogynistic or non-misogynistic is done using sophisticated machine learning and deep learning ap proaches. Model performance is evaluated using metrics like Accuracy, Precision, Recall, F1-Score, and Macro Average F1-Score. This study shows how multimodal deep learning can effectively detect and counteract negative narratives about women in digital media by combining natural language processing with image classification.
pdf
bib
abs
KCRL@DravidianLangTech 2025: Multi-Pooling Feature Fusion with XLM-RoBERTa for Malayalam Fake News Detection and Classification
Fariha Haq
|
Md. Tanvir Ahammed Shawon
|
Md Ayon Mia
|
Golam Sarwar Md. Mursalin
|
Muhammad Ibrahim Khan
The rapid spread of misinformation on social media platforms necessitates robust detection mechanisms, particularly for languages with limited computational resources. This paper presents our system for the DravidianLangTech 2025 shared task on Fake News Detection in Malayalam YouTube comments, addressing both binary and multiclass classification challenges. We propose a Multi-Pooling Feature Fusion (MPFF) architecture that leverages [CLS] + Mean + Max pooling strategy with transformer models. Our system demonstrates strong performance across both tasks, achieving a macro-averaged F1 score of 0.874, ranking 6th in binary classification, and 0.628, securing 1st position in multiclass classification. Experimental results show that our MPFF approach with XLM-RoBERTa significantly outperforms traditional machine learning and deep learning baselines, particularly excelling in the more challenging multiclass scenario. These findings highlight the effectiveness of our methodology in capturing nuanced linguistic features for fake news detection in Malayalam, contributing to the advancement of automated verification systems for Dravidian languages.
pdf
bib
abs
KCRL@DravidianLangTech 2025: Multi-View Feature Fusion with XLM-R for Tamil Political Sentiment Analysis
Md Ayon Mia
|
Fariha Haq
|
Md. Tanvir Ahammed Shawon
|
Golam Sarwar Md. Mursalin
|
Muhammad Ibrahim Khan
Political discourse on social media platforms significantly influences public opinion, necessitating accurate sentiment analysis for understanding societal perspectives. This paper presents a system developed for the shared task of Political Multiclass Sentiment Analysis in Tamil tweets. The task aims to classify tweets into seven distinct sentiment categories: Substantiated, Sarcastic, Opinionated, Positive, Negative, Neutral, and None of the above. We propose a Multi-View Feature Fusion (MVFF) architecture that leverages XLM-R with a CLS-Attention-Mean mechanism for sentiment classification. Our experimental results demonstrate the effectiveness of our approach, achieving a macro-average F1-score of 0.37 on the test set and securing the 2nd position in the shared task. Through comprehensive error analysis, we identify specific classification challenges and demonstrate how our model effectively navigates the linguistic complexities of Tamil political discourse while maintaining robust classification performance across multiple sentiment categories.
pdf
bib
abs
TensorTalk@DravidianLangTech 2025: Sentiment Analysis in Tamil and Tulu using Logistic Regression and SVM
K Anishka
|
Anne Jacika J
Words are powerful; they shape thoughts that influence actions and reveal emotions. On social media, where billions of people share theiropinions daily. Comments are the key to understanding how users feel about a video, an image, or even an idea. But what happens when these comments are messy, riddled with code-mixed language, emojis, and informal text? The challenge becomes even greater when analyzing low-resource languages like Tamil and Tulu. To tackle this, TensorTalk deployed cutting-edge machine learning techniques such as Logistic regression for Tamil language and SVM for Tulu language , to breathe life into unstructured data. By balancing, cleaning, and processing comments, TensorTalk broke through barriers like transliteration and tokenization, unlocking the emotions buried in the language.
pdf
bib
abs
TeamVision@DravidianLangTech 2025: Detecting AI generated product reviews in Dravidian Languages
Shankari S R
|
Sarumathi P
|
Bharathi B
Recent advancements in natural language processing (NLP) have enabled artificial intelligence (AI) models to generate product reviewsthat are indistinguishable from those written by humans. To address these concerns, this study proposes an effective AI detector model capable of differentiating between AI-generated and human-written product reviews. Our methodology incorporates various machine learning techniques, including Naive Bayes, Random Forest, Logistic Regression, SVM, and deep learning approaches based on the BERT architecture.Our findings reveal that BERT outperforms other models in detecting AI-generated content in both Tamil product reviews and Malayalam product reviews.
pdf
bib
abs
CIC-NLP@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Tewodros Achamaleh
|
Nida Hafeez
|
Mikiyas Mebraihtu
|
Fatima Uroosa
|
Grigori Sidorov
Misinformation is a growing problem for technologycompanies and for society. Although there exists a large body of related work on identifying fake news in predominantlyresource languages, there is unfortunately a lack of such studies in low-resource languages (LRLs). Because corpora and annotated data are scarce in LRLs, the identification of false information remains at an exploratory stage. Fake news detection is critical in this digital era to avoid spreading misleading information. This research work presents an approach to Detect Fake News in Dravidian Languages. Our team CIC-NLP work primarily targets Task 1 which involves identifying whether a given social platform news is original or fake. For fake news detection (FND) problem, we used mBERT model and utilized the dataset that was provided by the organizers of the workshop. In this work, we describe our findings and the results of the proposed method. Our mBERT model achieved an F1 score of 0.853.
pdf
bib
abs
CoreFour_IIITK@DravidianLangTech 2025: Abusive Content Detection Against Women Using Machine Learning And Deep Learning Models
Varun Balaji S
|
Bojja Revanth Reddy
|
Vyshnavi Reddy Battula
|
Suraj Nagunuri
|
Balasubramanian Palani
The rise in utilizing social media platforms increased user-generated content significantly, including negative comments about women in Tamil and Malayalam. While these platforms encourage communication and engagement, they also become a medium for the spread of abusive language, which poses challenges to maintaining a safe online environment for women. Prevention of usage of abusive content against women as much as possible is the main issue focused in the research. This research focuses on detecting abusive language against women in Tamil and Malayalam social media comments using computational models, such as Logistic regression model, Support vector machines (SVM) model, Random forest model, multilingual BERT model, XLM-Roberta model, and IndicBERT. These models were trained and tested on a specifically curated dataset containing labeled comments in both languages. Among all the approaches, IndicBERT achieved a highest macro F1-score of 0.75. The findings emphasize the significance of employing a combination of traditional and advanced computational techniques to address challenges in Abusive Content Detection (ACD) specific to regional languages.
pdf
bib
abs
The_Deathly_Hallows@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Vasantharan K
|
Prethish G A
|
Santhosh S
The DravidianLangTech@NAACL 2025 shared task focused on multimodal hate speech detection in Tamil, Telugu, and Malayalam using social media text and audio. Our approach integrated advanced preprocessing, feature extraction, and deep learning models. For text, preprocessing steps included normalization, tokenization, stopword removal, and data augmentation. Feature extraction was performed using TF-IDF, Count Vectorizer, BERT-base-multilingual-cased, XLM-Roberta-Base, and XLM-Roberta-Large, with the latter achieving the best performance. The models attained training accuracies of 83% (Tamil), 88% (Telugu), and 85% (Malayalam). For audio, Mel Frequency Cepstral Coefficients (MFCCs) were extracted and enhanced with augmentation techniques such as noise addition, time-stretching, and pitch-shifting. A CNN-based model achieved training accuracies of 88% (Tamil), 88% (Telugu), and 93% (Malayalam). Macro F1 scores ranked Tamil 3rd (0.6438), Telugu 15th (0.1559), and Malayalam 12th (0.3016). Our study highlights the effectiveness of text-audio fusion in hate speech detection and underscores the importance of preprocessing, multimodal techniques, and feature augmentation in addressing hate speech on social media.
pdf
bib
abs
SSN_IT_NLP@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Maria Nancy C
|
Radha N
|
Swathika R
The proliferation of social media platforms has resulted in increased instances of online abuse, particularly targeting marginalized groups such as women. This study focuses on the classification of abusive comments in Tamil and Malayalam, two Dravidian languages widely spoken in South India. Leveraging a multilingual BERT model, this paper provides an effective approach for detecting and categorizing abusive and non-abusive text. Using labeled datasets comprising social media comments, our model demonstrates its ability to identify targeted abuse with promising accuracy. This paper outlines the dataset preparation, model architecture, training methodology, and the evaluation of results, providing a foundation for combating online abuse in low-resource languages.The methodology is unique for its integration of multilingual BERT and weighted loss functions to address class imbalance, showcasing a pathway for effective abuse detection in other underrepresented languages. The BERT model achieved an F1-score of 0.6519 for Tamil and 0.6601 for Malayalam. The codefor this work is available on github Abusive-Text-targeting-women
pdf
bib
abs
Findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media: DravidianLangTech@NAACL 2025
Saranya Rajiakodi
|
Bharathi Raja Chakravarthi
|
Shunmuga Priya Muthusamy Chinnan
|
Ruba Priyadharshini
|
Raja Meenakshi J
|
Kathiravan Pannerselvam
|
Rahul Ponnusamy
|
Bhuvaneswari Sivagnanam
|
Paul Buitelaar
|
Bhavanimeena K
|
Jananayagan Jananayagan
|
Kishore Kumar Ponnusamy
This overview paper presents the findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media, organized as part of DravidianLangTech@NAACL 2025. The task aimed to encourage the development of robust systems to detectabusive content targeting women in Tamil and Malayalam, two low-resource Dravidian languages. Participants were provided with annotated datasets containing abusive and nonabusive text curated from YouTube comments. We present an overview of the approaches and analyse the results of the shared task submissions. We believe the findings presented in this paper will be useful to researchers working in Dravidian language technology.
pdf
bib
abs
LinguAIsts@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Dhanyashree G
|
Kalpana K
|
Lekhashree A
|
Arivuchudar K
|
Arthi R
|
Bommineni Sahitya
|
Pavithra J
|
Sandra Johnson
Social media sites are becoming crucial sites for communication and interaction, yet they are increasingly being utilized to commit gender-based abuse, with horrific, harassing, and degrading comments targeted at women. This paper tries to solve the common issue of women being subjected to abusive language in two South Indian languages, Malayalam and Tamil. To find explicit abuse, implicit bias, preconceptions, and coded language, we were given a set of YouTube comments labeled Abusive and Non-Abusive. To solve this problem, we applied and compared different machine learning models, i.e., Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes classifiers, to classify comments into the given categories. The models were trained and validated using the given dataset to achieve the best performance with respect to accuracy and macro F1 score. The solutions proposed aim to make robust content moderation systems that can detect and prevent abusive language, ensuring safer online environments for women.
pdf
bib
abs
Celestia@DravidianLangTech 2025: Malayalam-BERT and m-BERT based transformer models for Fake News Detection in Dravidian Languages
Syeda Alisha Noor
|
Sadia Anjum
|
Syed Ahmad Reza
|
Md Rashadur Rahman
Fake news detection in Malayalam is difficult due to limited data and language challenges. This study compares machine learning, deep learning, and transformer models for classification. The dataset is balanced and divided into training, development and test sets. Machine learning models (SVM, Random Forest, Naive Bayes) used TF-IDF features and deep learning models (LSTM, BiLSTM, CNN) worked with tokenized sequences. We fine-tuned transformer models like IndicBERT, MuRIL, mBERT, and Malayalam-Bert. Among them, the Malayalam-Bert model performed the best and achieved an F1 score of 86%. On the other hand mBERT performed best at spotting fake news. However, the models struggled with mixed-language text and complex writing. Despite these challenges, transformer models turned out to be the most effective for detecting fake news in Malayalam.
pdf
bib
abs
DravLingua@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages using Late Fusion of Muril and Wav2Vec Models
Aishwarya Selvamurugan
Detecting hate speech on social media is increasingly difficult, particularly in low-resource Dravidian languages such as Tamil, Telugu and Malayalam. Traditional approaches primarily rely on text-based classification, often overlooking the multimodal nature of online communication, where speech plays a pivotal role in spreading hate speech. We propose a multimodal hate speech detection model using a late fusion technique that integrates Wav2Vec 2.0 for speech processing and Muril for text analysis. Our model is evaluated on the DravidianLangTech@NAACL 2025 dataset, which contains speech and text data in Telugu, Tamil, and Malayalam scripts. The dataset is categorized into six classes: Non-Hate, Gender Hate, Political Hate, Religious Hate, Religious Defamation, and Personal Defamation. To address class imbalance, we incorporate class weighting and data augmentation techniques. Experimental results demonstrate that the late fusion approach effectively captures patterns of hate speech that may be missed when analyzing a single modality. This highlights the importance of multimodal strategies in enhancing hate speech detection, particularly for low-resource languages.
pdf
bib
abs
Trio Innovators @ DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Radha N
|
Swathika R
|
Farha Afreen I
|
Annu G
|
Apoorva A
This paper presents an in-depth study on multimodal hate speech detection in Dravidian languages—Tamil, Telugu, and Malayalam—by leveraging both audio and text modalities. Detecting hate speech in these languages is particularly challenging due to factors such as codemixing, limited linguistic resources, and diverse cultural contexts. Our approach integrates advanced techniques for audio feature extraction and XLM-Roberta for text representation, with feature alignment and fusion to develop a robust multimodal framework. The dataset is carefully categorized into labeled classes: gender-based, political, religious, and personal defamation hate speech, along with a non-hate category. Experimental results indicate that our model achieves a macro F1-score of 0.76 and an accuracy of approximately 85.
pdf
bib
abs
Wictory@DravidianLangTech 2025: Political Sentiment Analysis of Tamil X(Twitter) Comments using LaBSE and SVM
Nithish Ariyha K
|
Eshwanth Karti T R
|
Yeshwanth Balaji A P
|
Vikash J
|
Sachin Kumar S
Political sentiment analysis has become an essential area of research in Natural Language Processing (NLP), driven by the rapid rise ofsocial media as a key platform for political discourse. This study focuses on sentiment classification in Tamil political tweets, addressing the linguistic and cultural complexities inherent in low-resource languages. To overcome data scarcity challenges, we develop a system that integrates embeddings with advanced Machine Learning techniques, ensuring effective sentiment categorization. Our approach leverages deep learning-based models and transformer architectures to capture nuanced expressions, contributing to improved sentiment classification. This work enhances NLP methodologies for low-resource languages and provides valuable insights into Tamil political discussions, aiding policymakers and researchers in understanding public sentiment more accurately. Notably, our system secured Rank 5in the NAACL shared task, demonstrating its effectiveness in real-world sentiment classification challenges.
pdf
bib
abs
ANSR@DravidianLangTech 2025: Detection of Abusive Tamil and Malayalam Text Targeting Women on Social Media using RoBERTa and XGBoost
Nishanth S
|
Shruthi Rengarajan
|
S Ananthasivan
|
Burugu Rahul
|
Sachin Kumar S
Abusive language directed at women on social media, often characterized by crude slang, offensive terms, and profanity, is not just harmful communication but also acts as a tool for serious and widespread cyber violence. It is imperative that this pressing issue be addressed in order to establish safer online spaces and provide efficient methods for detecting and minimising this kind of abuse. However, the intentional masking of abusive language, especially in regional languages like Tamil and Malayalam, presents significant obstacles, making detection and prevention more difficult. The system created effectively identifies abusive sentences using supervised machine learning techniques based on RoBerta embeddings. The method aims to improve upon the current abusive language detection systems, which are essential for various online platforms, including social media and online gaming services. The proposed method currently ranked 8 in malayalam and 20 in tamil in terms of f1 score.
pdf
bib
abs
Synapse@DravidianLangTech 2025: Multiclass Political Sentiment Analysis in Tamil X (Twitter) Comments: Leveraging Feature Fusion of IndicBERTv2 and Lexical Representations
Suriya Kp
|
Durai Singh K
|
Vishal A S
|
Kishor S
|
Sachin Kumar S
Social media platforms like X (twitter) have gained popularity for political debates and election campaigns in the last decade. This creates the need to moderate and understand the sentiments of the tweets in order to understand the state of digital campaigns. This paper focuses on political sentiment classification of Tamil X (Twitter) comments which proves to be challenging because of the presence of informal expressions, code-switching, and limited annotated datasets. This study focuses on categorizing them into seven classes: substantiated, sarcastic, opinionated, positive, negative, neutral, and none of the above. This paper proposes a solution to Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments - DravidianLangTech@NAACL 2025 shared task, the solution incorporates IndicBERTv2-MLM-Back-Translation model and TF-IDF vectors into a custom model. Further we explore the use of preprocessing techniques to enrich hashtags and emojis with their context. Our approach achieved Rank 1 with a macro F1 average of 0.38 in the shared task.
pdf
bib
abs
Findings of the Shared Task on Misogyny Meme Detection: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi
|
Rahul Ponnusamy
|
Saranya Rajiakodi
|
Shunmuga Priya Muthusamy Chinnan
|
Paul Buitelaar
|
Bhuvaneswari Sivagnanam
|
Anshid K A
The rapid expansion of social media has facilitated communication but also enabled the spread of misogynistic memes, reinforcing gender stereotypes and toxic online environments. Detecting such content is challenging due to the multimodal nature of memes, where meaning emerges from the interplay of text and images. The Misogyny Meme Detection shared task at DravidianLangTech@NAACL 2025 focused on Tamil and Malayalam, encouraging the development of multimodal approaches. With 114 teams registered and 23 submitting predictions, participants leveraged various pretrained language models and vision models through fusion techniques. The best models achieved high macro F1 scores (0.83682 for Tamil, 0.87631 for Malayalam), highlighting the effectiveness of multimodal learning. Despite these advances, challenges such as bias in the data set, class imbalance, and cultural variations persist. Future research should refine multimodal detection methods to improve accuracy and adaptability, fostering safer and more inclusive online spaces.
pdf
bib
abs
Overview of the Shared Task on Sentiment Analysis in Tamil and Tulu
Thenmozhi Durairaj
|
Bharathi Raja Chakravarthi
|
Asha Hegde
|
Hosahalli Lakshmaiah Shashirekha
|
Rajeswari Natarajan
|
Sajeetha Thavareesan
|
Ratnasingam Sakuntharaj
|
Krishnakumari K
|
Charmathi Rajkumar
|
Poorvi Shetty
|
Harshitha S Kumar
Sentiment analysis is an essential task for interpreting subjective opinions and emotions in textual data, with significant implications across commercial and societal applications. This paper provides an overview of the shared task on Sentiment Analysis in Tamil and Tulu, organized as part of DravidianLangTech@NAACL 2025. The task comprises two components: one addressing Tamil and the other focusing on Tulu, both designed as multi-class classification challenges, wherein the sentiment of a given text must be categorized as positive, negative, neutral and unknown. The dataset was diligently organized by aggregating user-generated content from social media platforms such as YouTube and Twitter, ensuring linguistic diversity and real-world applicability. Participants applied a variety of computational approaches, ranging from classical machine learning algorithms such as Traditional Machine Learning Models, Deep Learning Models, Pre-trained Language Models and other Feature Representation Techniques to tackle the challenges posed by linguistic code-mixing, orthographic variations, and resource scarcity in these low resource languages.
pdf
bib
abs
cuetRaptors@DravidianLangTech 2025: Transformer-Based Approaches for Detecting Abusive Tamil Text Targeting Women on Social Media
Md. Mubasshir Naib
|
Md. Saikat Hossain Shohag
|
Alamgir Hossain
|
Jawad Hossain
|
Mohammed Moshiul Hoque
With the exponential growth of social media usage, the prevalence of abusive language targeting women has become a pressing issue, particularly in low-resource languages (LRLs) like Tamil and Malayalam. This study is part of the shared task at DravidianLangTech@NAACL 2025, which focuses on detecting abusive comments in Tamil social media content. The provided dataset consists of binary-labeled comments (Abusive or Non-Abusive), gathered from YouTube, reflecting explicit abuse, implicit bias, stereotypes, and coded language. We developed and evaluated multiple models for this task, including traditional machine learning algorithms (Logistic Regression, Support Vector Machine, Random Forest Classifier, and Multinomial Naive Bayes), deep learning models (CNN, BiLSTM, and CNN+BiLSTM), and transformer-based architectures (DistilBERT, Multilingual BERT, XLM-RoBERTa), and fine-tuned variants of these models. Our best-performing model, Multilingual BERT, achieved a weighted F1-score of 0.7203, ranking 19 in the competition.
pdf
bib
abs
Overview on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi
|
Saranya Rajiakodi
|
Thenmozhi Durairaj
|
Sathiyaraj Thangasamy
|
Ratnasingam Sakuntharaj
|
Prasanna Kumar Kumaresan
|
Kishore Kumar Ponnusamy
|
Arunaggiri Pandian Karunanidhi
|
Rohan R
Political multiclass detection is the task of identifying the predefined seven political classes. In this paper, we report an overview of the findings on the “Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments” shared task conducted at the workshop on DravidianLangTech@NAACL 2025. The participants were provided with annotated Twitter comments, which are split into training, development, and unlabelled test datasets. A total of 139 participants registered for this shared task, and 25 teams finally submitted their results. The performance of the submitted systems was evaluated and ranked in terms of the macro-F1 score.
pdf
bib
abs
KEC_AI_BRIGHTRED@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel
|
Malliga Subramanian
|
Nishdharani P
|
Santhiya E
|
Yaswanth Raj E
Hate speech detection in multilingual settings presents significant challenges due to linguistic variations and speech patterns across different languages. This study proposes a fusion-based approach that integrates audio and text features to enhance classification accuracy in Tamil, Telugu, and Malayalam. We extract Mel- Frequency Cepstral Coefficients and their delta variations for speech representation, while textbased features contribute additional linguistic insights. Several models were evaluated, including BiLSTM, Capsule Networks with Attention, Capsule-GRU, ConvLSTM-BiLSTM, and Multinomial Naïve Bayes, to determine the most effective architecture. Experimental results demonstrate that Random Forest performs best for text classification, while CNN achieves the highest accuracy for audio classification. The model was evaluated using the Macro F1 score and ranked ninth in Tamil with a score of 0.3018, ninth in Telugu with a score of 0.251, and thirteenth in Malayalam with a score of 0.2782 in the Multimodal Social Media Data Analysis in Dravidian Languages shared task at DravidianLangTech@NAACL 2025. By leveraging feature fusion and optimized model selection, this approach provides a scalable and effective framework for multilingual hate speech detection, contributing to improved content moderation on social media platforms.
pdf
bib
abs
Overview of the Shared Task on Fake News Detection in Dravidian Languages-DravidianLangTech@NAACL 2025
Malliga Subramanian
|
Premjith B
|
Kogilavani Shanmugavadivel
|
Santhiya Pandiyan
|
Balasubramanian Palani
|
Bharathi Raja Chakravarthi
Detecting and mitigating fake news on social media is critical for preventing misinformation, protecting democratic processes, preventing public distress, mitigating hate speech, reducing financial fraud, maintaining information reliability, etc. This paper summarizes the findings of the shared task “Fake News Detection in Dravidian Languages—DravidianLangTech@NAACL 2025.” The goal of this task is to detect fake content in social media posts in Malayalam. It consists of two subtasks: the first focuses on binary classification (Fake or Original), while the second categorizes the fake news into five types—False, Half True, Mostly False, Partly False, and Mostly True. In Task 1, 22 teams submitted machine learning techniques like SVM, Naïve Bayes, and SGD, as well as BERT-based architectures. Among these, XLM-RoBERTa had the highest macro F1 score of 89.8%. For Task 2, 11 teams submitted models using LSTM, GRU, XLM-RoBERTa, and SVM. XLM-RoBERTa once again outperformed other models, attaining the highest macro F1 score of 68.2%.