Kalaivani K S


2026

The prevalence of the use of the Tamil lan- guage on social media has heightened the need to address the issue of online harassment of women. As a result, there is a heightened need to develop a system to automatically iden- tify abusive content in the Tamil language to promote a safe online communication plat- form. This paper presents a model to iden- tify abusive content using a binary classifi- cation model to identify Abusive and Non- Abusive content. In this work, we experi- mented with several multilingual transformer models including DistilBERT, mBERT, and XLM-RoBERTa. From the experiments, it was observed that the XLM-RoBERTa model performed better than the others, achieving an accuracy of 91.17% and a macro F1 score of 0.8865. In this paper, ablation experiments are conducted to show that structured preprocess- ing, balancing the minority class, and tuning the hyperparameters contribute to the model’s performance
The rapid expansion of digital connectivity across India has dramatically increased participation in speech-enabled services and multilingual communication platforms. Tamil, with its rich dialectal diversity across geographical regions, presents unique challenges for automatic speech recognition and dialect identification systems. We participated in the DravidianLangTech 2026 shared task to classify Tamil speech into four regional dialects (Central, Northern, Southern, Western) and perform automatic speech recognition. We trained four machine learning models (SVM, Random Forest, CNN, CNN+BiLSTM) alongside two transfer learning models (Wav2Vec2-Base, Wav2Vec2-XLSR-53) for ASR. Among classification models, SVM with MFCC features achieved the best performance with 94.17% macro F1-score and validation accuracy of 94.35%. For ASR, Wav2Vec2-XLSR-53 obtained 15.3% WER, demonstrating effective cross-lingual knowledge transfer. Our analysis reveals that traditional machine learning approaches with engineered features outperform deep learning methods in low-resource scenarios with limited training data. Code is available at: https://github.com/Naveen-Arul/dravid-tech

2025

The introduction of Jio in India has significantly increased the number of social media users, particularly on platforms like X (Twitter), Facebook, Instagram. While this growth is positive, it has also led to a rise in native language speakers, making social media analysis more complex. In this study, we focus on Tamil, a Dravidian language, and aim to classify social media comments from X (Twitter) into seven different categories. Tamil speaking users often communicate using a mix of Tamil and English, creating unique challenges for analysis and tracking. This surge in diverse language usage on social media highlights the need for robust sentiment analysis tools to ensure the platform remains accessible and user-friendly for everyone with different political opinions. In this study we trained four machine learning models, SGD Classifier, Random Forest Classifier, Decision Tree, and Multinomial Naive Bayes classifier to identify and classify the comments. Among these, the SGD Classifier achieved the best performance, with a training accuracy of 83.67% and a validation accuracy of 80.43%.