Jubeerathan Thevakumar

2025

pdf bib abs
EmoTa: A Tamil Emotional Speech Dataset
Jubeerathan Thevakumar | Luxshan Thavarasa | Thanikan Sivatheepan | Sajeev Kugarajah | Uthayasanker Thayasivam
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

This paper introduces EmoTa, the first emotional speech dataset in Tamil, designed to reflect the linguistic diversity of Sri Lankan Tamil speakers. EmoTa comprises 936 recorded utterances from 22 native Tamil speakers (11 male, 11 female), each articulating 19 semantically neutral sentences across five primary emotions: anger, happiness, sadness, fear, and neutrality. To ensure quality, inter-annotator agreement was assessed using Fleiss’ Kappa, resulting in a substantial agreement score of 0.74. Initial evaluations using machine learning models, including XGBoost and Random Forest, yielded a high F1-score of 0.91 and 0.90 for emotion classification tasks. By releasing EmoTa, we aim to encourage further exploration of Tamil language processing and the development of innovative models for Tamil Speech Emotion Recognition.

pdf bib abs
Incepto@DravidianLangTech 2025: Detecting Abusive Tamil and Malayalam Text Targeting Women on YouTube
Luxshan Thavarasa | Sivasuthan Sukumar | Jubeerathan Thevakumar
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This study introduces a novel multilingualmodel designed to effectively address the challenges of detecting abusive content in low resource, code-mixed languages, where limiteddata availability and the interplay of mixed languages, leading to complex linguistic phenomena, create significant hurdles in developingrobust machine learning models. By leveraging transfer learning techniques and employingmulti-head attention mechanisms, our modeldemonstrates impressive performance in detecting abusive content in both Tamil and Malayalam datasets. On the Tamil dataset, our teamachieved a macro F1 score of 0.7864, whilefor the Malayalam dataset, a macro F1 score of0.7058 was attained. These results highlight theeffectiveness of our multilingual approach, delivering strong performance in Tamil and competitive results in Malayalam.

pdf bib abs
RATHAN@DravidianLangTech 2025: Annaparavai - Separate the Authentic Human Reviews from AI-generated one
Jubeerathan Thevakumar | Luheerathan Thevakumar
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Detecting AI-generated reviews is crucial for maintaining the authenticity of online feedback in low-resource languages like Tamil and Malayalam. We propose a transfer learning-based approach using embeddings from XLM-RoBERTa, IndicBERT, mT5, and Sentence-BERT, validated with five-fold cross-validation via XGBoost. These embeddings are used to train deep neural networks (DNNs), refined through a weighted ensemble model. Our method achieves 90% F1-score for Malayalam and 73% for Tamil, demonstrating the effectiveness of transfer learning and ensembling for review detection. The source code is publicly available to support further research and improve online review systems in multilingual settings.

Co-authors

Luheerathan Thevakumar 1

Venues

Fix data