Shanthi Murugan

2026

Trailblazer@DravidianLangTech 2026: A Comparative Study of TF-IDF SVM and XLM-RoBERTa for Political Multiclass Text Classification.
Anuradha C | Anbuaruvi R | Shanthi Murugan
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The rapid growth of social media networks faces challenges in the classification of multilingual and code-mixed data. A task is shared by Political Multiclass Sentiment Analysis of Tamil X (Twitter) -DravidianLangTech@ACL 2026 to classify the political text.For the above task, we proposed solutions to compare a traditional machine learning and the transformer based model. First we developed a Baseline traditional Support vector Machine model using the TF-IDF features. To provide a stronger Indic-language baseline we consider the IndicBERT, a transformer model specifically designed for Indian Languages. IndicBERT improves contextual understanding of Tamil-English code-mixed political text . To capture the deeper information from the text we developed an XLM-RoBERTa model where we used minimal pre-processing technique. The Result shows us that the transformer-based performs well compared to the traditional baseline model with the macro F1 score of 0.3738. The Study highlights the importance of robust multi-class social media political text classification.

2024

pdf bib abs

Challenges and Insights in Identifying Hate Speech and Fake News on Social Media
Shanthi Murugan | Arthi R | Boomika E | Jeyanth S | Kaviyarasu S
Proceedings of the 21st International Conference on Natural Language Processing (ICON): Shared Task on Decoding Fake Narratives in Spreading Hateful Stories (Faux-Hate)

Social media has transformed communication, but it has also brought abouta number of serious problems, most notablythe proliferation of hate speech and falseinformation. hate-related conversations arefrequently fueled by misleading narratives.We address this issue by building a multiclassclassification model trained on Faux HateMulti-Label Dataset (Biradar et al. 2024)which consists of hateful remarks that arefraudulent and have a code mix of Hindi andEnglish. Model has been built to classifySeverity (Low, Medium, High) and Target(Individual, Organization, Religion) on thedataset. Performance of the model isevaluated on test dataset achieved varyingscored for each. For Severity model achieves74%, for Target model achieves 74%. Thelimitations and performance issues of themodel has been understood and wellexplained.

pdf bib abs

Integration of Self-Attention Model with Intralingual Word Embedding for Contextual Semantic Analysis of Thirukkural Text
Shanthi Murugan | Kaviyarasu S | Balasundaram S R
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Thirukkural, one of the ancient works of Tamil Literature, is popular worldwide due to the moral values and practices it teaches to the society. Understanding the verses with meaning, especially context, is important. In this regard, this paper introduces a system designed to generate contextualized word meanings for the couplets of the Thirukkural, tailored to assist school children in understanding the text more effectively. Unlike traditional methods that provide detailed explanations in paragraph form, our method focuses on word-by-word interpretation, based on context through an integrated self-attention model. By combining the self-attention mechanism with FastText embeddings, our approach achieves improved performance over state-of-the-art models such as Word2Vec and standalone FastText. We evaluate the semantic understanding of the Thirukkural text using metrics as manual scoring. Tamil Thirukkural Agarathi serves as the gold-standard dataset for evaluation, demonstrating the effectiveness of our approach in capturing the nuanced semantics of the Thirukkural.

Co-authors

Jeyanth S 1

Balasundaram S R 1

Venues

Fix author