Uthayasanker Thayasivam

2025

pdf bib abs
SiTa - Sinhala and Tamil Speaker Diarization Dataset in the Wild
Uthayasanker Thayasivam | Thulasithan Gnanenthiram | Shamila Jeewantha | Upeksha Jayawickrama
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

The dynamic field of speaker diarization continues to present significant challenges, despite notable advancements in recent years and the rising focus on complex acoustic scenarios emphasizes the importance of sustained research efforts in this area. While speech resources for speaker diarization are expanding rapidly, aided by semi-automated techniques, many existing datasets remain outdated and lack authentic real-world conversational data. This challenge is particularly acute for low-resource South Asian languages, due to limited public media data and reduced research efforts. Sinhala and Tamil are two such languages with limited speaker diarization datasets. To address this gap, we introduce a new speaker diarization dataset for these languages and evaluate multiple existing models to assess their performance. This work provides essential resources, a novel dataset and valuable insights from model benchmarks to advance speaker diarization for low-resource languages, particularly Sinhala and Tamil.

pdf bib abs
A Dual Contrastive Learning Framework for Enhanced Hate Speech Detection in Low-Resource Languages
Krishan Chavinda | Uthayasanker Thayasivam
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

Hate speech on social media platforms is a critical issue, especially in low-resource languages such as Sinhala and Tamil, where the lack of annotated datasets and linguistic tools hampers the development of effective detection systems. This research introduces a novel framework for detecting hate speech in low resource languages by leveraging Multilingual Large Language Models (MLLMs) integrated with a Dual Contrastive Learning (DCL) strategy. Our approach enhances detection by capturing the nuances of hate speech in low-resource settings, applying both self-supervised and supervised contrastive learning techniques. We evaluate our framework using datasets from Facebook and Twitter, demonstrating its superior performance compared to traditional deep learning models like CNN, LSTM, and BiGRU. The results highlight the efficacy of DCL models, particularly when fine-tuned on domain-specific data, with the best performance achieved using the Twitter/twhin-bert-base model. This study underscores the potential of advanced machine learning techniques in improving hate speech detection for under-resourced languages, paving the way for further research in this domain.

pdf bib abs
EmoTa: A Tamil Emotional Speech Dataset
Jubeerathan Thevakumar | Luxshan Thavarasa | Thanikan Sivatheepan | Sajeev Kugarajah | Uthayasanker Thayasivam
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

This paper introduces EmoTa, the first emotional speech dataset in Tamil, designed to reflect the linguistic diversity of Sri Lankan Tamil speakers. EmoTa comprises 936 recorded utterances from 22 native Tamil speakers (11 male, 11 female), each articulating 19 semantically neutral sentences across five primary emotions: anger, happiness, sadness, fear, and neutrality. To ensure quality, inter-annotator agreement was assessed using Fleiss’ Kappa, resulting in a substantial agreement score of 0.74. Initial evaluations using machine learning models, including XGBoost and Random Forest, yielded a high F1-score of 0.91 and 0.90 for emotion classification tasks. By releasing EmoTa, we aim to encourage further exploration of Tamil language processing and the development of innovative models for Tamil Speech Emotion Recognition.

pdf bib abs
TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset
Aathavan Nithiyananthan | Jathushan Raveendra | Uthayasanker Thayasivam
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

A simile is a powerful figure of speech that makes a comparison between two different things via shared properties, often using words like “like” or “as” to create vivid imagery, convey emotions, and enhance understanding. However, computational research on similes is limited in low-resource languages like Tamil due to the lack of simile datasets. This work introduces a manually annotated Tamil Simile Dataset (TSD) comprising around 1.5k simile sentences drawn from various sources. Our data annotation guidelines ensure that all the simile sentences are annotated with the three components, namely tenor, vehicle, and context. We benchmark our dataset for simile interpretation and simile generation tasks using chosen pre-trained language models (PLMs) and present the results. Our findings highlight the challenges of simile tasks in Tamil, suggesting areas for further improvement. We believe that TSD will drive progress in computational simile processing for Tamil and other low-resource languages, further advancing simile related tasks in Natural Language Processing.

pdf bib abs
Towards Effective Emotion Analysis in Low-Resource Tamil Texts
Priyatharshan Balachandran | Uthayasanker Thayasivam | Randil Pushpananda | Ruvan Weerasinghe
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Emotion analysis plays a significant role in understanding human behavior and communication, yet research in Tamil language remains limited. This study focuses on building an emotion classifier for Tamil texts using machine learning (ML) and deep learning (DL), along with creating an emotion-annotated Tamil corpus for Ekman’s basic emotions. Our dataset combines publicly available data with re-annotation and translations. Along with traditional ML models we investigated the use of Transfer Learning (TL) with state-of-the-art models, such as BERT and Electra based models. Experiments were conducted on unbalanced and balanced datasets using data augmentation techniques. The results indicate that MultinomialNaive Bayes (MNB) and Support Vector Machine (SVM) performed well with TF-IDF and BoW representations, while among Transfer Learning models, LaBSE achieved the highest accuracy (63% balanced, 69% unbalanced), followed by TamilBERT and IndicBERT.

pdf bib abs
Advancing Multilingual Speaker Identification and Verification for Indo-Aryan and Dravidian Languages
Braveenan Sritharan | Uthayasanker Thayasivam
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

Multilingual speaker identification and verification is a challenging task, especially for languages with diverse acoustic and linguistic features such as Indo-Aryan and Dravidian languages. Previous models have struggled to generalize across multilingual environments, leading to significant performance degradation when applied to multiple languages. In this paper, we propose an advanced approach to multilingual speaker identification and verification, specifically designed for Indo-Aryan and Dravidian languages. Empirical results on the Kathbath dataset show that our approach significantly improves speaker identification accuracy, reducing the performance gap between monolingual and multilingual systems from 15% to just 1%. Additionally, our model reduces the equal error rate for speaker verification from 15% to 5% in noisy conditions. Our method demonstrates strong generalization capabilities across diverse languages, offering a scalable solution for multilingual voice-based biometric systems.

2024

pdf bib
Deep Information Maximisation to Mitigate Information Loss in Text Independent Speaker Verification
Nipun Fonseka | Nirmal Sankalana | Buddhika Karunarathne | Uthayasanker Thayasivam
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

2023

pdf bib abs
Party Extraction from Legal Contract Using Contextualized Span Representations of Parties
Sanjeepan Sivapiran | Charangan Vasantharajan | Uthayasanker Thayasivam
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Extracting legal entities from legal documents, particularly legal parties in contract documents, poses a significant challenge for legal assistive software. Many existing party extraction systems tend to generate numerous false positives due to the complex structure of the legal text. In this study, we present a novel and accurate method for extracting parties from legal contract documents by leveraging contextual span representations. To facilitate our approach, we have curated a large-scale dataset comprising 1000 contract documents with party annotations. Our method incorporates several enhancements to the SQuAD 2.0 question-answering system, specifically tailored to handle the intricate nature of the legal text. These enhancements include modifications to the activation function, an increased number of encoder layers, and the addition of normalization and dropout layers stacked on top of the output encoder layer. Baseline experiments reveal that our model, fine-tuned on our dataset, outperforms the current state-of-the-art model. Furthermore, we explore various combinations of the aforementioned techniques to further enhance the accuracy of our method. By employing a hybrid approach that combines 24 encoder layers with normalization and dropout layers, we achieve the best results, exhibiting an exact match score of 0.942 (+6.2% improvement).

2021

pdf bib abs
A Survey on Paralinguistics in Tamil Speech Processing
Anosha Ignatius | Uthayasanker Thayasivam
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Speech carries not only the semantic content but also the paralinguistic information which captures the speaking style. Speaker traits and emotional states affect how words are being spoken. The research on paralinguistic information is an emerging field in speech and language processing and it has many potential applications including speech recognition, speaker identification and verification, emotion recognition and accent recognition. Among them, there is a significant interest in emotion recognition from speech. A detailed study of paralinguistic information present in speech signal and an overview of research work related to speech emotion for Tamil Language is presented in this paper.

pdf bib abs
Hypers@DravidianLangTech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube Comments and Posts
Charangan Vasantharajan | Uthayasanker Thayasivam
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Code-Mixed Offensive contents are used pervasively in social media posts in the last few years. Consequently, gained the significant attraction of the research community for identifying the different forms of such content (e.g., hate speech, and sentiments) and contributed to the creation of datasets. Most of the recent studies deal with high-resource languages (e.g., English) due to many publicly available datasets, and by the lack of dataset in low-resource anguages, those studies are slightly involved in these languages. Therefore, this study has the focus on offensive language identification on code-mixed low-resourced Dravidian languages such as Tamil, Kannada, and Malayalam using the bidirectional approach and fine-tuning strategies. According to the leaderboard, the proposed model got a 0.96 F1-score for Malayalam, 0.73 F1-score for Tamil, and 0.70 F1-score for Kannada in the bench-mark. Moreover, in the view of multilingual models, this modal ranked 3rd and achieved favorable results and confirmed the model as the best among all systems submitted to these shared tasks in these three languages.

2020

pdf bib
Dialog policy optimization for low resource setting using Self-play and Reward based Sampling
Tharindu Madusanka | Durashi Langappuli | Thisara Welmilla | Uthayasanker Thayasivam | Sanath Jayasena
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf bib abs
A Privacy Preserving Data Publishing Middleware for Unstructured, Textual Social Media Data
Prasadi Abeywardana | Uthayasanker Thayasivam
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management

Privacy is going to be an integral part of data science and analytics in the coming years. The next hype of data experimentation is going to be heavily dependent on privacy preserving techniques mainly as it’s going to be a legal responsibility rather than a mere social responsibility. Privacy preservation becomes more challenging specially in the context of unstructured data. Social networks have become predominantly popular over the past couple of decades and they are creating a huge data lake at a high velocity. Social media profiles contain a wealth of personal and sensitive information, creating enormous opportunities for third parties to analyze them with different algorithms, draw conclusions and use in disinformation campaigns and micro targeting based dark advertising. This study provides a mitigation mechanism for disinformation campaigns that are done based on the insights extracted from personal/sensitive data analysis. Specifically, this research is aimed at building a privacy preserving data publishing middleware for unstructured social media data without compromising the true analytical value of those data. A novel way is proposed to apply traditional structured privacy preserving techniques on unstructured data. Creating a comprehensive twitter corpus annotated with privacy attributes is another objective of this research, especially because the research community is lacking one.

2019

pdf bib abs
Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages
Yohan Karunanayake | Uthayasanker Thayasivam | Surangika Ranathunga
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Current state-of-the-art speech-based user interfaces use data intense methodologies to recognize free-form speech commands. However, this is not viable for low-resource languages, which lack speech data. This restricts the usability of such interfaces to a limited number of languages. In this paper, we propose a methodology to develop a robust domain-specific speech command classification system for low-resource languages using speech data of a high-resource language. In this transfer learning-based approach, we used a Convolution Neural Network (CNN) to identify a fixed set of intents using an ASR-based character probability map. We were able to achieve significant results for Sinhala and Tamil datasets using an English based ASR, which attests the robustness of the proposed approach.

2018

pdf bib
Graph Based Semi-Supervised Learning Approach for Tamil POS tagging
Mokanarangan Thayaparan | Surangika Ranathunga | Uthayasanker Thayasivam
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
DataSEARCH at IEST 2018: Multiple Word Embedding based Models for Implicit Emotion Classification of Tweets with Deep Learning
Yasas Senarath | Uthayasanker Thayasivam
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper describes an approach to solve implicit emotion classification with the use of pre-trained word embedding models to train multiple neural networks. The system described in this paper is composed of a sequential combination of Long Short-Term Memory and Convolutional Neural Network for feature extraction and Feedforward Neural Network for classification. In this paper, we successfully show that features extracted using multiple pre-trained embeddings can be used to improve the overall performance of the system with Emoji being one of the significant features. The evaluations show that our approach outperforms the baseline system by more than 8% without using any external corpus or lexicon. This approach is ranked 8th in Implicit Emotion Shared Task (IEST) at WASSA-2018.