Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Sobha Lalitha Devi, Karunesh Arora (Editors)

Anthology ID:: 2024.icon-1
Month:: December
Year:: 2024
Address:: AU-KBC Research Centre, Chennai, India
Venue:: ICON
SIG:
Publisher:: NLP Association of India (NLPAI)
URL:: https://preview.aclanthology.org/icon-24-ingestion/2024.icon-1/
DOI:
Bib Export formats:: BibTeX

pdf bib
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Sobha Lalitha Devi | Karunesh Arora

pdf bib abs
Persuasion Games with Large Language Models
Shirish Karande | Santhosh V | Yash Bhatia

Large Language Models (LLMs) have emerged as formidable instruments capable of comprehending and producing human-like text. This paper explores the potential of LLMs, to shape human perspectives and subsequently influence their decisions on particular tasks. This capability finds applications in diverse domains such as Investment, Credit cards and Insurance, wherein they assist users in selecting appropriate insurance policies, investment plans, Credit cards, Retail, as well as in Behavioral Change Support Systems (BCSS). We present a sophisticated multi-agent framework wherein a consortium of agents operate in collaborative manner. The primary agent engages directly with users through persuasive dialogue, while the auxiliary agents perform tasks such as information retrieval, response analysis, development of persuasion strategies, and validation of facts. Empirical evidence from our experiments demonstrates that this collaborative methodology significantly enhances the persuasive efficacy of the LLM. We analyze user resistance to persuasive efforts continuously and counteract it by employing a combination of rule-based and LLM-based resistance-persuasion mapping techniques. We employ simulated personas and generate conversations in insurance, banking, and retail domains to evaluate the proficiency of large language models (LLMs) in recognizing, adjusting to, and influencing various personality types. Concurrently, we examine the resistance mechanisms employed by LLM simulated personas. Persuasion is quantified via measurable surveys before and after interaction, LLM-generated scores on conversation, and user decisions (purchase or non-purchase).

pdf bib abs
MULTILATE: A Synthetic Dataset on AI-Generated MULTImodaL hATE Speech
Advaitha Vetagiri | Eisha Halder | Ayanangshu Das Majumder | Partha Pakray | Amitava Das

One of the pressing challenges society faces today is the rapid proliferation of online hate speech, exacerbated by the rise of AI-generated multimodal hate content. This new form of synthetically produced hate speech presents unprecedented challenges in detection and moderation. In response to the growing presence of such harmful content across social media platforms, this research introduces a groundbreaking solution:

pdf abs
Sumotosima : A Framework and Dataset for Classifying and Summarizing Otoscopic Images
Eram Anwarul Khan | Anas Anwarul Haq Khan

Otoscopy is a diagnostic procedure to examine the ear canal and eardrum using an otoscope. It identifies conditions like infections, foreign bodies, eardrum perforations, and ear abnormalities. We propose a novel resource-efficient deep learning and transformer-based framework, Sumotosima (Summarizer for Otoscopic Images), which provides an end-to-end pipeline for classification followed by summarization. Our framework utilizes a combination of triplet and cross-entropy losses. Additionally, we use Knowledge Enhanced Multimodal BART, where the input is fused textual and image embeddings. The objective is to deliver summaries that are well-suited for patients, ensuring clarity and efficiency in understanding otoscopic images. Given the lack of existing datasets, we have curated our own OCASD (Otoscopy Classification And Summary Dataset), which includes 500 images with 5 unique categories, annotated with their class and summaries by otolaryngologists. Sumotosima achieved a result of 98.03%, which is 7.00%, 3.10%, and 3.01% higher than K-Nearest Neighbors, Random Forest, and Support Vector Machines, respectively, in classification tasks. For summarization, Sumotosima outperformed GPT-4o and LLaVA by 88.53% and 107.57% in ROUGE scores, respectively. We have made our code and dataset publicly available at https://github.com/anas2908/Sumotosima

pdf abs
Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction
Manas Jain | Sriparna Saha | Pushpak Bhattacharyya | Gladvin Chinnadurai | Manish Vatsa

Question Answering systems these days typically use template-based language generation. Though adequate for a domain-specific task, these systems are too restrictive and predefined for domain-independent systems. This paper proposes a system that outputs a full-length answer given a question and the extracted factoid answer (short spans such as named entities) as the input. Our system uses constituency and dependency parse trees of questions. A transformer-based Grammar Error Correction model GECToR is used as a post-processing step for better fluency. We compare our system with (i) a Modified Pointer Generator (SOTA) and (ii) Fine-tuned DialoGPT for factoid questions. We also tested our approach on existential (yes-no) questions with better results. Our model generates more accurate and fluent answers than the state-of-the-art (SOTA) approaches. The evaluation is done on NewsQA and SqUAD datasets with an increment of 0.4 and 0.9 percentage points in ROUGE-1 score respectively. Also, the inference time is reduced by 85% compared to the SOTA. The improved datasets used for our evaluation will be released as part of the research contribution.

The advent of sophisticated large language models, such as ChatGPT and other AI-driven platforms, has led to the generation of text that closely mimics human writing, making it increasingly challenging to discern whether it is human-generated or AI-generated content. This poses significant challenges to content verification, academic integrity, and detecting misleading information. To address these issues, we developed a classification system to differentiate between human-written and AI-generated texts using a diverse HC3-English dataset. This dataset leveraged linguistic analysis and structural features, including part-of-speech tags, vocabulary size, word density, active and passive voice usage, and readability metrics such as Flesch Reading Ease, perplexity, and burstiness. We employed transformer-based and deep-learning models for the classification task, such as CNN_BiLSTM, RNN, BERT, GPT-2, and RoBERTa. Among these, the RoBERTa model demonstrated superior performance, achieving an impressive accuracy of 99.73. These outcomes demonstrate how cutting-edge deep learning methods can maintain information integrity in the digital realm.

pdf abs
Quality Estimation of Machine Translated Texts based on Direct Evidence Approach
Vibhuti Kumari | Narayana Murthy Kavi

Quality Estimation task deals with the estimation of quality of translations produced by a Machine Translation system without depending on Reference Translations. A number of approaches have been suggested over the years. In this paper we show that the parallel corpus used as training data for training the MT system holds direct clues for estimating the quality of translations produced by the MT system. Our experiments show that this simple, direct and computationally efficient method holds promise for quality estimation of translations produced by any purely data driven machine translation system.

The success of virtual assistants relies on continuous performance monitoring to ensure their competitive edge in the market. This entails assessing their ability to understand user intents and execute tasks effectively. While user feedback is pivotal for measuring satisfaction levels, relying solely on explicit feedback proves impractical. Thus, extracting implicit user feedback from conversations of user and virtual assistant is a more efficient approach. Additionally, along with learning whether a task is performed correctly or not, it is extremely important to understand the reasons behind any incorrect execution. In this paper, we introduce a framework designed to identify dissatisfactory conversations, systematically analyze these conversations, and generate comprehensive reports detailing the reasons for user dissatisfaction. By implementing a feedback classifier, we identify conversations that indicate user dissatisfaction, which serves as a sign of implicit negative feedback. To analyze negative feedback conversations more deeply, we develop a lightweight pipeline called an issue categorizer ensemble with multiple models to understand the reasons behind such dissatisfactory conversations. We subsequently augment the identified discontented instances to generate additional data and train our models to prevent such failures in the future. Our implementation of this simple framework, called AsTrix (Assisted Triage and Fix), led to significant enhancements in the performance of our smartphone-based In-House virtual assistant, with successful task completion rates increasing from 83.1% to 92.6% between June 2022 and March 2024. Moreover, by automating the deeper analysis process targeting just five major issue types contributing to the dissatisfaction, we significantly address approximately 62% of the negative feedback conversation data.

pdf abs
Vector Embedding Solution for Recommendation System
Vidya P V | Ajeesh Ramanujan

We propose a vector embedding approach for recommendation systems aimed at identifying product affinities and suggesting complementary items. By capturing relationships between products, the model delivers highly relevant recommendations based on the context. A neural network is trained on purchase data to generate word embeddings, represented as a weight matrix. The resulting model predicts complementary products with top-20 and top-50 precision scores of 0.59251 and 0.29556, respectively. These embeddings effectively identify products likely to be co-purchased, enhancing the relevance and accuracy of the recommendations.

pdf abs
Multi-document Summarization by Ensembling of Scoring and Topic Modeling Techniques
Rajendra Kumar Roul | Navpreet | Saif Nalband

With the growing volume of text, finding relevant information is increasingly difficult. Automatic Text Summarization (ATS) addresses this by efficiently extracting relevant content from large document collections. Despite progress, ATS faces challenges like managing long, repetitive sentences, preserving coherence, and maintaining semantic alignment. This work introduces an extractive summarization approach based on topic modeling to address these issues. The proposed method produces summaries with representative sentences, reduced redundancy, concise content, and strong semantic consistency. Its effectiveness, demonstrated through experiments on DUC datasets, outperforms state-of-the-art techniques.

pdf abs
Assessing Assamese Suffix Productivity: A Probabilistic Study in Resource-Limited Contexts
Pinky Moni Gayan | Arup K. Nath

Numerous digitally advanced global languages have been studied under the light of morphological productivity; however, Assamese and other Indo-Aryan languages are still understudied in this field, though it is a widely discussed area of morphology. The purpose of this paper is to demonstrate the productivity of 15 suffixes replicated by a few measuring methods in a manually prepared sample. The obtained values are used in the later section to group the suffixes into different clusters based on their similar productivity rate in clustering in R. By determining the general productivity rate of the suffixes from the total productivity rates of all the methods, it demonstrates how clustering in R may be used as an empirical and visual tool for grouping similarly productive suffixes. The paper also reports about the paucity of language resources as well as tools in the language and how bridging this gap could have resulted in more precise, seamless results in a notably shorter amount of time.

pdf abs
Identification of Idiomatic Expressions in Konkani Language Using Neural Networks
Naziya Mahamdul Shaikh | Jyoti Pawar

The task of multi-word expressions identification and processing has posed a remarkable challenge to the natural language processing applications. One related subtask in this arena is correct labelling of the sentences with the presence of idiomatic expressions as either literal or idiomatic sense. The regional Indian language Konkani spoken in the states located in the west coast of India lacks in the research in idiom processing tasks. We aim at bridging this gap through a contribution to idiom identification method in Konkani language. This paper classifies the idiomatic expression usage in Konkani language as idiomatic or literal usage using a neural network-based setup. The developed system was able to successfully perform the identification task with an accuracy of 79.5 % and F1-score of 0.77.

pdf abs
TRO(F)LL or ROFL ? : Exploring Troll Detection in Tamil Memes
Aditya Krishna Ponnudurai | Swetha J | Rajalakshmi Sivanaiah

The advent of social networks has deeply improved and enhanced the ways in which people communicate. However, along with the positives, there are negatives as well. The rapid dissemination of information via various means, be it tweets, Whatsapp forwards or memes has led to widespread misinformation and online abuse. The increasing prevalence of misinformation and

pdf abs
Empowering SW Security: CodeBERT and Machine Learning Approaches to Vulnerability Detection
Lov Kumar | Vikram Singh | Srivalli Patel | Pratyush Mishra

Software (SW) systems experience faults after deployment, raising concerns about reliability and leading to financial losses, reputational damage, and safety risks. This paper presents a novel approach using CodeBERT, a state-of-the-art neural code representation model pre-trained in multi-programming languages and employs various code metrics to predict SW faults. The study comprehensively evaluates trained models by analyzing publicly available codebase and employing diverse machine learning models, feature selection techniques, and class balancing through SMOTE. The results show that SMOTE significantly enhances vulnerability detection performance, particularly in accuracy, AUC, sensitivity, and specificity. The EXTR classifier consistently outperforms others, with an average AUC of 0.82, and the features selected using the GA feature selection technique, despite achieving a mean AUC of 0.84. Interestingly, among employed embedding techniques, SW metrics combined with CodeBERT (SMCBERT) stand out as top performers, achieving the highest mean AUC score of 0.80, making models trained on SMCBERT the best for SW vulnerability prediction.

pdf abs
Exploring Expected Answer Types for Effective Question Answering Systems for low resource language
Chindukuri Mallikarjuna | Sangeetha Sivanesan

Question-answering (QA) systems play a pivotal role in natural language processing (NLP), powering applications such as search engines and virtual assistants by providing accurate responses to user queries. However, building effective QA systems for Dravidian languages, like Tamil, poses distinct challenges due to the scarcity of resources and the linguistic complexities inherent to these languages. This paper introduces a novel method to enhance QA accuracy by integrating answer-type features alongside traditional question and context inputs. We fine-tuned both mono- and multilingual pre-trained models on the Extended Chaii dataset, which comprises Tamil translations from the SQuAD dataset, as well as on the SQuAD-EAT-5000 dataset, consisting of English-language instances. Our experiments reveal that incorporating answer-type features significantly improves model performance compared to using only question and context inputs. Specifically, for the Extended Chaii dataset, the MuRIL model achieved the highest F1 score of 53.89, surpassing other pre-trained models, while RoBERTa outperformed BERT on the SQuAD-EAT-5000 dataset with a score of 82.07. This research advances QA systems for Dravidian languages and underscores the importance of integrating linguistic features for improved accuracy.

pdf abs
A self-supervised domain-independent Named Entity Recognition using local similarity
Keerthi S. A. Vasan | Uma Satya Ranjan

Out-of-vocabulary words can be challenging for NER systems. We introduce a self-supervised system for Named Entity Recognition based on a few-shot annotated examples provided by experts. Subsequently, the rest of the words are tagged using the closest similarity match between the word embeddings of each category, generated in the same context as the original annotations. Additionally, we use a dual-threshold scheme to improve the robustness of the method. Our results show that this method outperforms current state-of-the-art methods in both accuracy and generalisation.

pdf abs
Precision Empowers, Excess Distracts: Visual Question Answering With Dynamically Infused Knowledge In Language Models
Manas Jhalani | Annervaz K M | Pushpak Bhattacharyya

In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing natural language questions grounded in visual content. Knowledge-Based Visual Question Answering (KBVQA) advances this concept by adding external knowledge along with images to respond to questions. We introduce an approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model . Our main contribution involves enhancing questions by incorporating relevant external knowledge extracted from knowledge graphs, using a dynamic triple extraction

pdf abs
We Care: Multimodal Depression Detection and Knowledge Infused Mental Health Therapeutic Response Generation
Palash Moon | Pushpak Bhattacharyya

The detection of depression through non-verbal cues has gained significant attention. Previous research predominantly centred on identifying depression within the confines of controlled laboratory environments, often with the supervision of psychologists or counsellors. Unfortunately, datasets generated in such controlled settings may struggle to account for individual behaviours in real-life situations. In response to this limitation, we present the Extended D-vlog dataset, encompassing a collection of 1,261 YouTube vlogs. Additionally, the emergence of large language models (LLMs) like GPT3.5, and GPT4 has sparked interest in their potential that LLMs can act like mental health professionals. Yet, the readiness of these LLM models to be used in real-life settings is still a concern as they can give wrong responses that can harm the users. We introduce a virtual agent serving as an initial contact for mental health patients, offering Cognitive Behavioral Therapy (CBT)-based responses. It comprises two core functions: 1. Identifying depression in individuals, and 2. Delivering CBT-based therapeutic responses. Our Mistral model achieved impressive scores of 70.1% and 30.9% for distortion assessment and classification, along with a Bert score of 88.7%. Moreover, utilizing the TVLT model on our Multimodal Extended D-vlog Dataset yielded outstanding results, with an impressive F1-score of 67.8%

pdf abs
Aiding Non-Verbal Communication: A Bidirectional Language Agnostic Framework for Automating Text to AAC Generation
Piyali Karmakar | Manjira Sinha

Persons with severe speech and motor impairments (SSMI), like those with cerebral palsy (CP) experience significant challenges via communication in conventional methods. Many a times they rely on Graphical symbol-based Augmentative and Alternative Communication (AAC) systems to facilitate the communication. Our work aims to support AAC communication by developing specialized datasets for direct translation of Graphical Symbols to Natural Language text. The dataset is enhanced with an automated Text-to-Pictogram generation module. The dataset is enriched with some additive information like tense-based information and subjective information (questionnaires, exclamations). Additionally, we expanded our efforts to include translation into Indian language Bengali, for those individuals with SSMI who are more comfortable communicating in their native language. We aim to develop an end-to-end language agnostic framework for efficient bidirectional communication between non-verbal AAC picture symbols and textual data.

pdf abs
Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges
Lov Kumar | Vikram Singh | Proksh | Pratyush Mishra

Social media has become a global platform where users express opinions on diverse contemporary topics, often blending dominant languages with native tongues, leading to code-mixed, context-rich content. A typical example is Hinglish, where Hindi elements are embedded in English texts. This linguistic mixture challenges traditional NLP systems, which rely on monolingual resources and need help to process multilingual content. Sentiment analysis for code-mixed data, mainly involving Indian languages, remains largely unexplored. This paper introduces a novel approach for sentiment analysis of code-mixed Hinglish data, combining translation, different stacking classifier architectures, and embedding techniques. We utilize pre-trained LoRA weights of a fine-tuned Gemma-2B model to translate Hinglish into English, followed by the employment of four pre-trained meta embeddings: GloVe-T, Word2Vec, TF-IDF, and fastText. SMOTE is applied to balance skewed data, and dimensionality reduction is performed before implementing machine learning models and stacking classifier ensembles. Three ensemble architectures, combining 22 base classifiers with a Logistic Regression meta-classifier, test different meta-embedding combinations. Experimental results show that the TF-W2V-FST (TF-IDF, Word2Vec, fastText) combination performs best, with SVM radial bias achieving the highest accuracy 91.53% and AUC (0.96). This research contributes a novel and effective technique to sentiment analysis for code-mixed data.

pdf abs
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations
Manav Chaudhary | Harshit Gupta | Savita Bhat | Vasudeva Varma

Traditional evaluation metrics like BLEU and ROUGE fall short when capturing the nuanced qualities of generated text, particularly when there is no single ground truth. In this paper, we explore the potential of Large Language Models (LLMs), specifically Google Gemini 1, to serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks. We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality annotators when compared with human judgments on the SummEval and USR datasets, asking the model to generate both a score as well as a justification for the score. Furthermore, we explore the robustness of the LLM evaluator by using perturbed inputs. Our findings suggest that while LLMs show promise, their alignment with human evaluators is limited, they are not robust against perturbations and significant improvements are required for their standalone use as reliable evaluators for subjective metrics.

pdf abs
Improving on the Limitations of the ASR Model in Low-Resourced Environments Using Parameter-Efficient Fine-Tuning
Rupak Raj Ghimire | Prakash Poudyal | Bal Krishna Bal

Modern general-purpose speech recognition systems are more robust in languages with high resources. In contrast, achieving state-of-the-art accuracy for low-resource languages is still challenging. The fine-tuning of the pre-trained model is one of the highly popular practices which utilizes the existing information while efficiently learning from a small amount of data to enhance the precision and robustness of speech recognition tasks. This work attempts to diagnose the performance of a pre-trained model when transcribing the audio from the low-resource language. In this work, we apply an adapter-based iterative parameter-efficient fine-tuning strategy on a limited dataset aiming to improve the quality of the transcription of a previously fine-tuned model. For the experiment we used Whisper’s multilingual pre-trained speech model and Nepali as a test language. Using this approach we achieved Word Error Rate of 27.9%,which is more than 19% improvement over pre-trained Whisper Large − V2.

pdf abs
Towards Enhancing Knowledge Accessibility for Low-Resource Indian Languages: A Template Based Approach
Srijith Padakanti | Akhilesh Aravapalli | Abhijith Chelpuri | Radhika Mamidi

In today’s digital age, access to knowledge and information is crucial for societal growth. Although widespread resources like Wikipedia exist, there is still a linguistic barrier to breakdown for low-resource languages. In India, millions of individuals still lack access to reliable information from Wikipedia because they are only proficient in their regional language. To address this gap, our work focuses on enhancing the content and digital footprint of multiple Indian languages. The primary objective of our work is to improve knowledge accessibility by generating a substantial volume of high-quality Wikipedia articles in Telugu, a widely spoken language in India with around 95.7 million native speakers. Our work aims to create Wikipedia articles and also ensures that each article meets necessary quality standards such as a minimum word count, inclusion of images for reference, and an infobox. Our work also adheres to the five core principles of Wikipedia. We streamline our article generation process, leveraging NLP techniques such as translation, transliteration, and template generation and incorporating human intervention when necessary. Our contribution is a collection of 8,929 articles in the movie domain, now ready to be published on Telugu Wikipedia.

pdf abs
Monolingual text summarization for Indic Languages using LLMs
Jothir Adithya T K | Nithish Kumar S | Felicia Lilian J | Mahalakshmi S

We have analyzed the growth of advanced text summarization method leveraging LLM for Indic language. Text summarization involves transforming a longer text information into a more concise version, ensuring that the most prominent information and key meanings are maintained. Our goal is to produce concise and accurate summaries from longer texts, focusing on maintaining detailed information and coherence. We utilize NLP techniques for text cleaning, keyword extraction and summarization, along with performance evaluation metrics such as ROUGE score, BLEU score and BERT Score. The results demonstrate an incremental improvement in the quality of generated summaries, with a particular emphasis on enhancing informativeness while minimizing redundancy. This research work also highlights the importance of tuning parameters and leveraging advanced models for producing high quality summaries in diverse domains for Indic Language.

Offensive and profane content has been on the rise in Nepali Social Media, which, is very disturbing to users. This is partly due to the absence of proper tools and mechanisms for the Nepali language to deal with profanity and offensive texts. In this work, we attempt to develop a deep learning-based profanity and offensive comments detection tool. We develop a Bi-LSTM (Bidirectional Long Short Term Memory) based model for the classification of Profane and Offensive comments and study different variations of the task. Furthermore, Multilingual BERT embedding and vocab embedding were used among others for an accurate understanding of the intent and decency of the posts. While previous related studies in the Nepali language are more focused on sentiment and offensiveness detection only, our study explores profanity and offensiveness detection as two distinct tasks. Our Bi-LSTM model outputs 87.8% accuracy for

pdf abs
PollCardioKG: A Dynamic Knowledge Graph of Interaction Between Pollution and Cardiovascular Diseases
Sudeshna Jana | Anunak Roy | Manjira Sinha | Tirthankar Dasgupta

In recent decades, environmental pollution has become a pressing global health concern. According to the World Health Organization (WHO), a significant portion of the population is exposed to air pollutant levels exceeding safety guidelines. Cardiovascular diseases (CVDs) — including coronary artery disease, heart attacks, and strokes — are particularly significant health effects of this exposure. In this paper, we investigate the effects of air pollution on cardiovascular health by constructing a dynamic knowledge graph based on extensive biomedical literature. This paper provides a comprehensive exploration of entity identification and relation extraction, leveraging advanced language models. Additionally, we demonstrate how in-context learning with large language models can enhance the accuracy and efficiency of the extraction process. The constructed knowledge graph enables us to analyze the relationships between pollutants and cardiovascular diseases over the years, providing deeper insights into the long-term impact of cumulative exposure, underlying causal mechanisms, vulnerable populations, and the role of emerging contaminants in worsening various cardiac outcomes.

pdf abs
Comprehensive Plagiarism Detection in Malayalam Texts Through Web and Database Integration
Meharuniza Nazeem | Parvathy Raj | Rajeev R. R | Anitha R | Navaneeth S

Plagiarism detection techniques have become essential for recognizing instances of plagiarism, particularly in the domain of academics where scientific papers and documents are of prime importance. We propose an application that offers a comprehensive solution for detecting plagiarism in scholarly articles written in Malayalam, enabling users to submit texts, analyze them for plagiarism, and review the results interactively. With the increasing accessibility of digital content, maintaining originality in academic writing has become more tedious. Our research addresses this challenge by providing a solution tailored to the Malayalam language. The application aids researchers and academic institutions in detecting potential plagiarism by accessing web-based content and algorithmic text analysis. The study significantly contributes to the field of plagiarism detection for low resource language such as malayalam and offers a practical way to preserve the originality of Malayalam scholarly work. The performance of four algorithms SequenceMatcher, N-Grams, Rabin-Karp, and Cosine Similarity is thoroughly evaluated. Cosine Similarity, with a 92.45% detection rate, outperformed the others, significantly surpassing Rabin-Karp(65.3%), N-Grams(58.7%) and SequenceMatcher(51.4%). Using this improved efficiency, a user-friendly web application was developed that integrates web search and database comparison features with the Cosine Similarity algorithm.

pdf abs
Enhancing Trust and Interpretability in Malayalam Sentiment Analysis with Explainable AI
Meharuniza Nazeem | Anitha R | Navaneeth S | Rajeev R. R

Natural language processing (NLP) has seen a rise in the use of explainable AI, especially for low-resource languages like Malayalam. This study builds on our earlier research on sentiment analysis which uses identified views to classify and understand the context. Support Vector Machine (SVM) and Random Forest (RF) classifiers are two machine learning approaches that we used to do sentiment analysis on the Kerala political opinion corpus. Using Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) features, we construct feature vectors for sentiment analysis. In this, analysis of the Random Forest classifier’s performance shows that it outperforms SVM in terms of accuracy and efficiency, with an accuracy of 85.07 %. Using Local Interpretable Model-Agnostic Explanations (LIME) as a foundation, we address the interpretability of text classification and sentiment analysis models. This integration increases user confidence and model use by offering concise and understandable justifications for model predictions. The study lays the groundwork for future developments in the area by demonstrating the significance of explainable AI in NLP for low-resource languages.

pdf abs
Open-Source OCR Libraries: A Comprehensive Study for Low Resource Language
Meharuniza Nazeem | Anitha R | Navaneeth S | Rajeev R. R

This paper reviews numerous OCR programs and libraries employed for optical character recognition tasks. Tesser- act OCR, an open-source program that supports multiple lan- guages and image formats, is highlighted for its accuracy and adaptability. Python-based libraries like EasyOCR, MMOCR, and PaddleOCR are also mentioned, which provide user-friendly interfaces and trained models for text extraction, detection, and recognition. EasyOCR emphasizes ease of use and sim- plicity, while MMOCR and PaddleOCR offer comprehensive OCR capabilities and support for a wide range of languages. According to our study, which evaluates various OCR libraries, Tesseract OCR performs remarkably well in terms of accuracy for Indian languages like Malayalam. We focused on five OCR libraries—Tesseract OCR, MMOCR, PaddleOCR, EasyOCR, and Keras OCR—and tested them across several languages, including English, Hindi, Arabic, Tamil, and Malayalam. During our comparison, we found that Tesseract OCR was the only library that supported the Malayalam language. While the other libraries did not support Malayalam, Tesseract OCR performed well across all tested languages, achieving accuracy rates of 92% in English, 93% in Hindi, 78% in Tamil, 74% in Arabic, and 93% in Malayalam.

pdf abs
Value to User’s Voice: A Generative AI Framework for Actionable Insights from Customer Reviews in Consumer Electronics
Radhika Mundra | Bhavesh Kukreja | Aritra Ghosh Dastidar | Kartikey Singh | Javaid Nabi

Customer reviews are a valuable asset for businesses, especially in the competitive consumer electronics sector, where understanding user preferences and product performance is critical. However, extracting meaningful insights from these unstructured and often noisy reviews is a challenging task that typically requires significant manual effort. We present

pdf abs
Exploring Kolmogorov Arnold Networks for Interpretable Mental Health Detection and Classification from Social Media Text
Ajay Surya Jampana | Mohitha Velagapudi | Neethu Mohan | Sachin Kumar S

Mental health analysis from social media text demands both high accuracy and interpretability for responsible healthcare applications. This paper explores Kolmogorov Arnold Networks (KANs) for mental health detection and classification, demonstrating their superior performance compared to Multi-Layer Perceptrons (MLPs) in accuracy while requiring fewer parameters. To further enhance interpretability, we leverage the Local Interpretable Model Agnostic Explanations (LIME) method to identify key features, resulting in a simplified KAN model. This allows us to derive governing equations for each class, providing a deeper understanding of the relationships between texts and mental health conditions.

pdf abs
Automatic Sanskrit Poetry Classification Based on Kāvyaguṇa
Amruta Barbadikar | Amba Kulkarni

Kāvyaguṇa denotes the syntactic and phonetic attributes or qualities of Sanskrit poetry that enhance its artistic appeal, commonly classified into three categories: Mādhyurya (Sweetness), Oja (Floridity), and Prasāda (Lucidity). This paper presents the Kāvyaguṇa Classifier, a machine learning module, designed to classify Sanskrit literary texts into three distinct guṇas, by employing a diverse range of machine learning algorithms, including Random Forest, Gradient Boosting, XGBoost, Multi-Layer Perceptron and Support Vector Machine. For vectorization, we employed two methods: the neural network-based Word2vec and a custom feature engineering approach grounded in the theoretical understanding of Kāvyaguṇas as described in Sanskrit poetics. The feature engineering model significantly outperformed, achieving an accuracy of up to 90.6%

pdf abs
Landscape Painter: Mimicking Human Like Art Using Generative Adversarial Networks
Yash Gogoriya | Oswald C | Abhijith Balan

Generating paintings using AI has been an intriguing area of research and has posed significant challenges in recent years. Landscape painting is a type of man-made ecological art form which contributes to preserving the ecological integrity of the environment we live in. Generative AI based Painting constitutes a form of visual expression encompassing various elements like drawings, arrangement, and conceptualization. Existing generative models do not replicate the painting process followed by a human painter. A human artist creates artwork in various stages such as: Sketching, Outlining and Colouring. Current generative models frequently restrict the range and diversity of styles by depending solely on carefully selected datasets such as WikiArt and VanGogh. The proposed work intends to utilize scraping techniques to collect a wide range of comprehensive and diverse landscape paintings. The primary objective of this research is to apply various generative AI models to generate artwork that replicate a human painting process and encompasses various artistic themes and styles instead of relying on a particular one. Performance of our work has shown that the landscape painting generation into distinct sketch and color phases have proven to be effective, fun and realistic.

pdf abs
Towards Efficient Audio-Text Keyword Spotting: Quantization and Multi-Scale Linear Attention with Foundation Models
Rahothvarman P | Radhika Mamidi

Open Vocabulary Keyword Spotting is essential in numerous applications, from virtual assistants to security systems, as it allows systems to identify specific words or phrases in continuous speech. In this paper, we propose a novel end-to-end method for detecting user-defined open vocabulary keywords by leveraging linguistic patterns for the correlation between audio and text modalities. Our approach utilizes quantized pre-trained foundation models for robust audio embeddings and a unique lightweight Multi-Scale Linear Attention (MSLA) network that aligns speech and text representations for effective cross-modal agreement. We evaluate our method on two distinct datasets, comparing its performance against other baselines. The results highlight the effectiveness of our approach, achieving significant improvements over the Cross-Modality Correspondence Detector (CMCD) method, with a 16.08% increase in AUC and a 17.2% reduction in EER metrics on the Google Speech Commands dataset. These findings demonstrate the potential of our method to advance keyword spotting across various real-world applications.

pdf abs
Shabdocchar: Konkani WordNet Enrichment with Audio Feature
Sunayana R. Gawde | Shrikrishna R. Parab | Jayram Ulhas Gawas | Shilpa Neenad Desai | Jyoti Pawar

Konkani WordNet, also called Konkani Shabdamalem, was created as part of the Indradhanush WordNet Project Consortium between August 2010 and October 2013. Currently, the Konkani WordNet includes about 32,370 synsets and 37,719 unique words. There is a need to enhance the Konkani WordNet both quantitatively as well as qualitatively. In this paper we are presenting a Game-Based Crowdsourcing approach adopted by us to add audio feature to the Konkani WordNet which has resulted in an increase in the number of users using and getting exposed to the capabilities of the Konkani WordNet to aid in the Konkani language teaching-learning process as well as for creation of resources to initiate further research. Our work presented here has resulted in the creation of an audio corpus of 37,719 unique words which we have named as ‘Shabdocchar’ within a short time span of four months covering five dialects of Konkani. We are confident that Shabdocchar will prove to be a very useful resource to support future research work on Dialects of Konkani and support voice-based search of words in the wordnet. This approach can be adopted to enhance other wordnets as well.

pdf abs
Konkani Wordnet Visualizer as a Concept Teaching-Learning Tool
Sunayana R. Gawde | Jayram Ulhas Gawas | Shrikrishna R. Parab | Shilpa Neenad Desai | Jyoti Pawar

The Visualizer is a tree-structure designed to browse and explore the Konkani WordNet lexical database. We propose to utilise this tool as a concept teaching and learning resource for Konkani, to be used by both teachers and students. It can also be used to add the missing semantic and lexical relations, thus enhancing the wordnet. It extracts related concepts for a given word and displays them as a sub-tree. The interface includes various features to offer users greater flexibility in navigating and understanding the word relationships. We attempted to enrich the Konkani Wordnet qualitatively with a Visualizer that offers an improved usability and is incorporated in the Konkani Wordnet website for the public use. The Visualizer is designed to provide graphical representations of words and their semantic relationships, making it easier to explore connections and meanings within the lexical database.

pdf abs
A Comparative Assessment of Machine Learning Techniques in Kannada Multi-Emotion Sentiment Analysis
Dakshayani Ijeri | Pushpa B. Patil

In order to advance a firm, it is crucial to understand user opinions on social media. India has diversity, with Kannada being one of the widely spoken languages. Sentiment analysis in Kannada offers a tool to assess opinion, gather customer feedback, and identify social media trends among the Kannada-speaking community. This kind of analysis assists businesses in comprehending the sentiments expressed in Kannada language customer reviews, social media posts, and online conversations. It empowers them to make choices based on data and customize their offerings to better suit the needs of their customers. This work proposes a model to perform sentiment analysis in Kannada language with four emotions, namely anger, fear, joy, and sadness, using machine learning algorithms like linear support vector classification, logistic regression, stochastic gradient descent, K-nearest neighbors, multinomial naive bayes, and random forest classification. The model achieved an accuracy of 87.25% with a linear support vector classifier.

With the advancement of natural language processing (NLP) and sophisticated Large Language Models (LLMs), distinguishing between human-written texts and machine-generated texts is quite difficult nowadays. This paper presents a systematic approach to classifying machine-generated text from human-written text with a combination of the transformer-based model and textual feature-based post-processing technique. We extracted five textual features: readability score, stop word score, spelling and grammatical error count, unique word score and human phrase count from both human-written and machine-generated texts separately and trained three machine learning models (SVM, Random Forest and XGBoost) with these scores. Along with exploring traditional machine-learning models, we explored the BiLSTM and transformer-based distilBERT models to enhance the classification performance. By training and evaluating with a large dataset containing both human-written and machine-generated text, our best-performing framework achieves an accuracy of 87.5%.

pdf abs
SansGPT: Advancing Generative Pre-Training in Sanskrit
Rhugved Pankaj Chaudhari | Bhakti Jadhav | Pushpak Bhattacharyya | Malhar Kulkarni

In the past decade, significant progress has been made in digitizing Sanskrit texts and advancing computational analysis of the language. However, efforts to advance NLP for complex semantic downstream tasks like Semantic Analogy Prediction, Named Entity Recognition, and others remain limited. This gap is mainly due to the absence of a robust, pre-trained Sanskrit model built on large-scale Sanskrit text data since this demands considerable computational resources and data preparation. In this paper, we introduce SansGPT, a generative pre-trained model that has been trained on a large corpus of Sanskrit texts and is designed to facilitate fine-tuning and development for downstream NLP tasks. We aim for this model to serve as a catalyst for advancing NLP research in Sanskrit. Additionally, we developed a custom tokenizer specifically optimized for Sanskrit text, enabling effective tokenization of compound words and making it better suited for generative tasks. Our data collection and cleaning process encompassed a wide array of available Sanskrit literature, ensuring comprehensive representation for training. We further demonstrate the model’s efficacy by fine-tuning it on Semantic Analogy Prediction and Simile Element Extraction, achieving an impressive accuracy of approximately 95.8% and 92.8%, respectively.

pdf abs
LOC: Livestock Ontology Construction Approach From Domain based Text Documents
Nandhana Prakash | Amudhan A | Nithish R | Krithikha Sanju Saravanan

Livestock plays an irreplaceable role in rural and global economies and as a part of its progression livestock ontology would unlock its potential of cross - domain applications of Natural Language Processing (NLP). Domain data is essential for the retrieval of semantic and syntactic understanding of the input text data given to the model. The paper presents a Livestock based Ontology Construction (LOC) is proposed. The input data endures anaphora resolution employing semantic methods based on rules then the pre-trained BERT model with Regular expression are utilized for retrieving terms (entities) from the data. Now the Graph Neural Network (GNN) is constructed with Regular Expressions for extricating relationships from the input documents for designing the livestock ontology. The efficaciousness of the proposed LOC based on the BERT model with regular expressions and GNN method with Regular expressions depicts noteworthy results when compared to existing methods, showing a precision and recall of 97.56% and 95.24%.

pdf abs
A Systematic Exploration of Linguistic Phenomena in Spoken Hindi: Resource Creation and Hypothesis Testing
Aadya Ranjan | Sidharth Ranjan | Rajakrishnan Rajkumar

This paper presents a meticulous and well-structured approach to annotating a corpus of Hindi spoken data. We deployed 4 annotators to augment the spoken section of the EMILLE Hindi corpus by marking the various linguistic phenomena observed in spoken data. Then we analyzed various phonological (sound deletion), morphological (code-mixing and reduplication) and syntactic phenomena (case markers and ambiguity), not attested in written data. Code mixing and switching and constitute the majority of the phenomena we annotated, followed by orthographic errors related to symbols in the Devanagiri script. In terms of divergences from written form of Hindi, case marker usage, missing auxiliary verbs and agreement patterns are markedly distinct for spoken Hindi. The annotators also assigned a quality rating to each sentence in the corpus. Our analysis of the quality ratings revealed that most of the sentences in the spoken data corpus are of moderate to high quality. Female speakers produced a greater percentage of high quality sentences compared to their male counterparts. While previous efforts in corpus annotation have been largely focused on creating resources for engineering applications, we illustrate the utility of our dataset for scientific hypothesis testing. Inspired from the Surprisal Theory of language comprehension, we validate the hypothesis that sentences with high values of lexical surprisal are rated low in terms of quality by native speakers, even when controlling for sentence length and word frequencies in a sentence.

pdf abs
Analytics Graph Query Solver (AGQS): Transforming Natural Language Queries into Actionable Insights
Debojyoti Saha | Krishna Singh | Moushumi Mahato | Javaid Nabi

In today’s era, data analytics is crucial because it allows organizations to make informed decisions based on the analysis of large amounts of data. The evolving landscape of data analytics presents a growing challenge in effectively translating natural language queries into actionable insights. To address this challenge, we introduce a novel system that seamlessly integrates natural language processing (NLP), graph-based feature representation, and code generation. Our method, called Analytics Graph Query Solver (AGQS), utilizes large language models (LLMs) to construct a dynamic graph representing keywords and engineered features. AGQS transforms textual input queries into structured descriptions and generates corresponding plans. These plans are executed stepwise to create a unified code, which is subsequently applied to our in-house virtual assistant dataset to fulfill the user’s query. Furthermore, a robust verification module ensures the reliability of the obtained results. Through experimentation, our system achieved an accuracy of 62.2%, outperforming models like GPT-4 (50.2%), Graph Reader (56.6%), Mistral3 7B (38.6%), and Llama 7B (37.6%). Overall, our approach highlights the importance of feature generation in textual query resolution and demonstrates notable improvements in accessibility and precision for data analytics. With this method, we aim to present a solution for converting natural language queries into actionable steps, ultimately generating code that provides data insights. This approach can be utilized across different datasets, empowering developers and researchers to gain valuable insights effortlessly.

pdf abs
Automatic Summarization of Long Documents
Naman Chhibbar | Jugal Kalita

A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.

pdf abs
A Corpus of Hindi-English Code-Mixed Posts to Hate Speech Detection
Prashant Kapil | Asif Ekbal

Social media content, such as blog posts, comments, and tweets, often contains offensive language, including racial hate speech, personal attacks, and sexual harassment. Detecting inappropriate language is crucial for user safety and prevention of hateful behavior and aggression. This study introduces the HECM (Hindi-English code-mixed tweets) to fill the gap in Hindi language resources. The corpus comprises approximately 9.4K tweets labeled as hateful and nonhateful. It includes detailed information on the data, such as the annotation schema, the label definitions, and an interannotator agreement score of 85%. The study evaluates the effectiveness of traditional machine learning, deep neural networks, and transformer encoder-based approaches. The results show a significant improvement in terms of macro-F1 and weighted F1 scores. Additionally, a lexicon containing 2000 lexicons tagged in 21 categories is created based on the multilingual HURTLEX lexicon. This lexicon is merged with the transformer encoder, resulting in a marginal improvement in macro-F1 and weighted-F1. The study also experiments with a Hindi-Devanagari dataset to assess the impact of the lexicon on performance metrics.

pdf abs
A Survey on Combating Hate Speech through Detection and Prevention in English
Prashant Kapil | Asif Ekbal

The rapid rise of social networks has brought with it an increase in hate speech, which poses a significant challenge to society, researchers, companies, and policymakers. Hate speech can take the form of text or multimodal content, such as memes, GIFs, audio, or videos, and the scientific study of hate speech from a computer science perspective has gained attention in recent years. The detection and combating of hate speech is mostly considered a supervised task, with annotated corpora and shared resources playing a crucial role. Social networks are using modern AI tools to combat hate speech, and this survey comprehensively examines the work done to combat hate in the English language. It delves into state-of-the-art methodologies for unimodal and multimodal hate identification, the role of explainable AI, prevention of hate speech through style transfer, and counternarrative generation, while also discussing the efficacy and limitations of these methods. Compared with earlier surveys, this paper offers a well-organized presentation of methods to combat hate.

pdf abs
Extractive Summarization using Extended TextRank Algorithm
Ansh N. Vora | Rinit Mayur Jain | Aastha Sanjeev Shah | Sheetal Sonawane

With so much information available online, it’s more important than ever to have reliable tools for summarizing text quickly and accurately. In this paper, we introduce a new way to improve the popular TextRank algorithm for extractive summarization. By adding a dynamic damping factor and using Latent Dirichlet Allocation (LDA) to enhance how text is represented, our method creates more meaningful summaries. We tested it with metrics like Pyramid, METEOR, and ROUGE, and compared it to the original TextRank. The results were promising, showing that our approach produces better summaries and could be useful for real-world applications like text mining and information retrieval.

pdf abs
Emojis Trash or Treasure: Utilizing Emoji to Aid Hate Speech Detection
Tanik Saikh | Soham Barman | Harsh Kumar | Saswat Sahu | Souvick Palit

In this study, we delve into the fascinating realm of emojis and their impact on identifying hate speech in both Bengali and English languages. Through extensive exploration of various techniques, particularly the integration of Multilingual BERT (MBert) and Emoji2Vec embeddings, we strive to shed light on the immense potential of emojis in this detection process. By meticulously comparing these advanced models with conventional approaches, we uncover the intricate contextual cues that emojis bring to the table. Ultimately, our discoveries underscore the invaluable role of emojis in hate speech detection, thereby providing valuable insights for the creation of resilient and context-aware systems to combat online toxicity. Our findings showcase the potential of emojis as valuable assets rather than mere embellishments in the realm of hate speech detection. By leveraging the combined strength of MBert and Emoji2Vec, our models exhibit enhanced capabilities in deciphering the emotional subtleties often intertwined with hate speech expressions.

A novel approach to text-to-speech synthesis that integrates pitch contour labels derived from the highest occurrence analysis for each Part-of-Speech (POS) tag. Using the Stanford POS Tagger, grammatical tags are assigned to words, and the most frequently occurring pitch contour labels associated with these tags are analyzed, focusing on both unigram and bigram statistics. The primary goal is to identify the pitch contour for each POS tag based on its frequency of occurrence. These pitch contour labels are incorporated into the output of the synthesized waveform using the TD-PSOLA (Time Domain Pitch Synchronous Overlap and Add) signal processing algorithm. The resulting waveform is evaluated using Mean Opinion Scores (MOS), demonstrating significant enhancements in quality and producing a prosodically rich synthetic speech.

pdf abs
Chirp Group Delay based Feature for Speech Applications
Malarvizhi Muthuramalingam | Anushiya Rachel Gladston | P Vijayalakshmi | T Nagarajan

Conventional Fast Fourier Transform (FFT),computed on the unit circle, gives an accurate representation of the spectrum if the signal under consideration is because of the sustained oscillations. However, practical signals are not sustained oscillations. For the signals that are either decaying/growing along time, the phase spectrum computed using conventional FFT is not accurate, and in turn, the magnitude spectrum too. Hence a feature, based on a variant of the group delay spectrum, namely the chirp group delay (CGD) spectrum, is proposed. The efficacy of the proposed feature is evaluated in Gaussian Mixture Model (GMM) and Convolutional Neural Network (CNN)-based speaker identification systems. Analysis reveals a significant increase in performance when using the CGD-based feature over the magnitude spectrum.

pdf abs
From Data to Insights: The Power of LM’s in Match Summarization
Satyavrat Gaur | Pasi Shailendra | Rajdeep Kumar | Rudra Chandra Ghosh | Nitin Sharma

The application of Natural Language Processing is progressively extending into many domains as time progresses. We are motivated to evaluate language model’s (LMs) capabilities in many real-world domains due to their significant potential. This study examines the use of LMs in sports, explicitly emphasizing their ability to convert data into text and their understanding of cricket. By examining cricket scorecards, a widely played sport on the Indian subcontinent and many other regions, we will evaluate the summaries produced by LMs from several viewpoints. We have collected concise summaries of the scorecards from the ODI World Cup 2023 and assessed them in terms of both factual accuracy and sports-specific significance. We analyze the specific factors that are included in the summaries and those that are excluded. Additionally, it analyzes prevalent mistakes concerning completeness, correctness, and conciseness. We are presenting our findings here and also our dataset and code are available here https://github.com/satyawork/ODI-WORLDCUP.git

pdf abs
Automating Humor: A Novel Approach to Joke Generation Using Template Extraction and Infilling
Mayank Goel | Parameswari Krishnamurthy | Radhika Mamidi

This paper presents a novel approach to humor generation in natural language processing by automating the creation of jokes through template extraction and infilling. Traditional methods have relied on predefined templates or neural network models, which either lack complexity or fail to produce genuinely humorous content. Our method introduces a technique to extract templates from existing jokes based on semantic salience and BERT’s attention weights. We then infill these templates using advanced techniques, through BERT and large language models (LLMs) like GPT-4, to generate new jokes. Our results indicate that the generated jokes are novel and human-like, with BERT showing promise in generating funny content and GPT-4 excelling in creating clever jokes. The study contributes to a deeper understanding of humor generation and the potential of AI in creative domains.

pdf abs
Sentiment and sarcasm: Analyzing gender bias in sports through social media with deep learning
Sethulakshmi Praveen | Balaji Tk | Sreeja Sr | Annushree Bablani

Gender bias continues to be a pervasive issue, especially in public discourse surrounding high-profile events like the Olympics. Social media platforms, particularly Twitter, have become a central space for discussing such biases, making it crucial to analyze these conversations to better understand public attitudes. Sentiment analysis plays a key role in this effort by determining how people feel about gender bias. However, sarcasm often complicates sentiment analysis by distorting the true sentiment of a tweet, as sarcastic expressions can mask negative or positive sentiments. To address this, the study introduces a novel framework called SENSA (SENtiment and Sarcasm Analysis), designed to detect both sentiment and sarcasm in tweets related to gender bias. The framework leverages the R2B-CNN model for robust sarcasm and sentiment classification. Using approximately 5,000 tweets related to gender bias from 2010 to August 30, 2024, SENSA applies advanced sarcasm detection to account for shifts in sentiment caused by sarcastic remarks. The R2B-CNN model demonstrates a high accuracy of 92.32% along with achieving 92.75% precision and 92.53% F1-score for sarcasm detection and a 93.67% accuracy, 92.33% precision and 92.33% F1-score for sentiment classification. SENSA provides a comprehensive understanding of gender bias discussions on social media by capturing both sentiment and sarcasm to reveal deeper insights into public perceptions.

pdf abs
An Aid to Assamese Language Processing by Constructing an Offline Assamese Handwritten Dataset
Debabrata Khargharia | Samir Kumar Borgohain

Recent years have seen a growing interest in analyzing Indian handwritten documents. In pattern recognition, particularly handwritten document recognition, the availability of standard databases is essential for assessing algorithm efficacy and facilitating result comparisons among research groups. However, there is a notable scarcity of standardized databases for handwritten texts in Indian languages. This paper presents a comprehensive methodology for the development of a novel, unconstrained dataset named OAHTD (Offline Assamese Handwritten Text Dataset) for the Assamese language, derived from offline handwritten documents. The dataset, which represents a significant contribution to the field of Optical Character Recognition (OCR) for handwritten Assamese, is the first of its kind in this domain. The corpus comprises 410 document images, each containing a diverse array of linguistic elements including words, numerals, individual characters, and various symbols. These documents were collected from a demographically diverse cohort of 300 contributors, spanning an age range of 10 to 76 years and representing varied educational backgrounds and genders. This meticulously curated collection aims to provide a robust foundation for the development and evaluation of OCR algorithms specifically tailored to the Assamese script, addressing a critical gap in the existing literature and resources for this language.

pdf abs
Enhancing Masked Word Prediction in Tamil Language Models: A Synergistic Approach Using BERT and SBERT
Viswadarshan R R | Viswaa Selvam S | Felicia Lilian J | Mahalakshmi S

This research work presents a novel approach to enhancing masked word prediction and sentence-level semantic analysis in Tamil language models. By synergistically combining BERT and Sentence-BERT (SBERT) models, we leverage the strengths of both architectures to capture the contextual understanding and semantic relationships in Tamil Language sentences. Our methodology incorporates sentence tokenization as a crucial pre-processing step, preserving the grammatical structure and word-level dependencies of Tamil sentences. We trained BERT and SBERT on a diverse corpus of Tamil data, including synthetic datasets, the Oscar Corpus, AI4Bharat Parallel Corpus, and data extracted from Tamil Wikipedia and news websites. The combined model effectively predicts masked words while maintaining semantic coherence in generated sentences. While traditional accuracy metrics may not fully capture the model’s performance, intrinsic and extrinsic evaluations reveal the model’s ability to generate contextually relevant and linguistically sound outputs. Our research highlights the importance of sentence tokenization and the synergistic combination of BERT and SBERT for improving masked word prediction in Tamil sentences.

pdf abs
Pronominal Anaphora Resolution in Konkani language incorporating Gender Agreement
Poonam A. Navelker | Jyoti Pawar

Konkani is a low-resource language, spoken mainly on the central west coast of India. Approximately 2.3 million people speak Konkani (Office of the Registrar General Census Commissioner, India,2011). It is also the official language of the state of Goa. It belongs to the Southern Indo-Aryan language group. The official Script for writing the Konkani language is Devanagari. Despite this, being a low-resource language has hampered its development on the digital platform, Konkani has yet to significantly impact its digital presence. To improve this situation, contribution to Natural Language Understanding in the Konkani language is important. This paper aims to resolve pronominal anaphora in the Konkani language using a rule-based method incorporating gender agreement. This is required in NLP applications like text summarization, machine translation, and question-answering systems. While research on English and other foreign languages, as well as Indian languages like Tamil, Kannada, Malayalam, Bengali, and Marathi, have been done, no work has been done on the Konkani language thus far. This is the very first attempt made to resolve anaphora in Konkani.

pdf abs
Reconsidering SMT Over NMT for Closely Related Languages: A Case Study of Persian-Hindi Pair
Waisullah Yousofi | Pushpak Bhattacharyya

This paper demonstrates that Phrase-Based Statistical Machine Translation (PBSMT) can outperform Transformer-based Neural Machine Translation (NMT) in moderate-resource scenarios, specifically for structurally similar languages, Persian-Hindi pair in our case. Despite the Transformer architecture’s typical preference for large parallel corpora, our results show that PBSMT achieves a BLEU score of 66.32, significantly exceeding the Transformer-NMT score of 53.7 ingesting the same dataset.

pdf abs
DesiPayanam: developing an Indic travel partner
Diviya K N | Mrinalini K | Vijayalakshmi P | Thenmozhi J | Nagarajan T

Domain-specific machine translation (MT) systems are essential in bridging the communication gap between people across different businesses, economies, and countries. India, a linguistically rich country with a booming tourism industry is a perfect market for such an MT system. On this note, the current work aims to develop a domain-specific transformer-based MT system for Hindi-to-Tamil translation. In the current work, neural-based MT (NMT) model is trained from scratch and the hyper-parameters of the model architecture are modified to analyze its effect on the translation performance. Further, a finetuning approach is adopted to finetune a pretrained transformer MT model to better suit the tourism domain. The proposed experiments are observed to improve the BLEU scores of the translation system by a maximum of 1% and 4% for the training from scratch and finetuned systems respectively.

pdf abs
RoMantra: Optimizing Neural Machine Translation for Low-Resource Languages through Romanization
Govind Soni | Pushpak Bhattacharyya

Neural Machine Translation (NMT) for low-resource language pairs with distinct scripts, such as Hindi-Chinese and Japanese-Hindi, poses significant challenges due to scriptural and linguistic differences. This paper investigates the efficacy of romanization as a preprocessing step to bridge these gaps. We compare baseline models trained on native scripts with models incorporating romanization in three configurations: both-side, source-side only, and target-side only. Additionally, we introduce a script restoration model that converts romanized output back to native scripts, ensuring accurate evaluation. Our experiments show that romanization, particularly when applied to both sides, improves translation quality across the studied language pairs. The script restoration model further enhances the practicality of this approach by enabling evaluation in native scripts with some performance loss. This work provides insights into leveraging romanization for NMT in low-resource, cross-script settings, presenting a promising direction for under-researched language combinations.

pdf abs
Domain Dynamics: Evaluating Large Language Models in English-Hindi Translation
Soham Bhattacharjee | Baban Gain | Asif Ekbal

Large Language Models (LLMs) have demonstrated impressive capabilities in machine translation, leveraging extensive pre-training on vast amounts of data. However, this generalist training often overlooks domain-specific nuances, leading to potential difficulties when translating specialized texts. In this study, we present a multi-domain test suite, collated from previously published datasets, designed to challenge and evaluate the translation abilities of LLMs. The test suite encompasses diverse domains such as judicial, education, literature (specifically religious texts), and noisy user-generated content from online product reviews and forums like Reddit. Each domain consists of approximately 250-300 sentences, carefully curated and randomized in the final compilation. This English-to-Hindi dataset aims to evaluate and expose the limitations of LLM-based translation systems, offering valuable insights into areas requiring further research and development. We have submitted the dataset to WMT24 Break the LLM

pdf abs
Pronunciation scoring for dysarthric speakers with DNN-HMM based goodness of pronunciation (GoP) measure
Shruti Jeyaraman | Anantha K. Krishnan | Vijayalakshmi P | Nagarajan T

Dysarthria is a neurological motor disorder caused by cranial damage that interferes with the muscles involved in the correct pronunciation of sounds and intelligible speech. Computer Aided Pronunciation training (CAPT) systems traditionally used for the pronunciation assessment of L2 language learners can offer a method to detect and score mispronounced sounds in dysarthric speakers as a way of evaluation without human intervention. In this work, a phonetic level DNN-HMM based Goodness of Pronunciation (GoP) for pronunciation scoring, on native Tamil Dysarthric speakers corpus is presented. The scores are calculated using the posteriors of the subphonemic elements called senones with a focus on their prevalence across phones and their transitions across HMM states. The phonetic-level scores obtained for speakers of different levels of severity help establish speaker-specific trends in pronunciation through an objective log-likelihood metric, in contrast to subjective evaluations by Speech Language Therapists (SLTs).

pdf abs
Severity Classification and Dysarthric Speech Detection using Self-Supervised Representations
Sanjay B | Priyadharshini M.k | Vijayalakshmi P | Nagarajan T

Automatic detection and classification of dysarthria severity from speech provides a non-invasive and efficient diagnostic tool, offering clinicians valuable insights to guide treatment and therapy decisions. Our study evaluated two pre-trained models—wav2vec2-BASE and distilALHuBERT, for feature extraction to build speech detection and severity-level classification systems for dysarthric speech. We conducted experiments on the TDSC dataset using two approaches: a machine learning model (support vector machine, SVM) and a deep learning model (convolutional neural network, CNN). Our findings showed that features derived from distilALHuBERT significantly outperformed those from wav2vec2-BASE in both dysarthric speech detection and severity classification tasks. Notably, the distilALHuBERT features achieved 99% accuracy in automatic detection and 95% accuracy in severity classification, surpassing the performance of wav2vec2 features.

pdf abs
Aspect-based Summaries from Online Product Reviews: A Comparative Study using various LLMs
Pratik Deelip Korkankar | Alvyn Abranches | Pradnya Bhagat | Jyoti Pawar

In the era of online shopping, the volume of product reviews for user products on e-commerce platforms is massively increasing on a daily basis. For any given user product, it consists of a flood of reviews and manually analysing each of these reviews to understand the important aspects or opinions associated with the products is difficult and time-consuming task. Furthermore, it becomes nearly impossible for the customer to make decision of buying the product or not. Thus, it becomes necessary to have an aspect-based summary generated from these user reviews, which can act as a guide for the interested buyer in decision-making. Recently, the use of Large Language Models (LLMs) has shown great potential for solving diverse Natural Language Processing (NLP) tasks, including the task of summarization. Our paper explores the use of various LLMs such as Llama3, GPT-4o, Gemma2, Mistral, Mixtral and Qwen2 on the publicly available domain-specific Amazon reviews dataset as a part of our experimentation work. Our study postulates an algorithm to accurately identify product aspects and the model’s ability to extract relevant information and generate concise summaries. Further, we analyzed the experimental results of each of these LLMs with summary evaluation metrics such as Rouge, Meteor, BERTScore F1 and GPT-4o to evaluate the quality of the generated aspect-based summary. Our study highlights the strengths and limitations of each of these LLMs, thereby giving valuable insights for guiding researchers in harnessing LLMs for generating aspect-based summaries of user products present on these online shopping platforms.

Story building is an important part of language and overall development of a child. Developing an interactive and artificial intelligence (AI) based solution to create stories for children is an open and challenging problem. Methods combining large language models (LLMs) and knowledge graphs (KGs) have further enabled high quality and coherent story generation. In this work, we present a platform, Story Yarn, developed for interactive story creation for children. We customise a KG, using children stories, which captures relationships between components of stories. This customised KG is then used along with LLM to collaboratively create a story. We have also built a simple app to facilitate user interaction. This platform can aid the creative development of children, and can be used at home or in schools.

pdf abs
CM_CLIP: Unveiling Code-Mixed Multimodal Learning with Cross-Lingual CLIP Adaptations
Gitanjali Kumari | Arindam Chatterjee | Ashutosh Bajpai | Asif Ekbal | Vinutha B. NarayanaMurthy

In this paper, we present CMCLIP, a Code-Mixed Contrastive Linked Image Pre-trained model, an innovative extension of the widely recognized CLIP model. Our work adapts the CLIP framework to the code-mixed environment through a novel cross-lingual teacher training methodology. Building on the strengths of CLIP, we introduce the first code-mixed pre-trained text-and-vision model, CMCLIP, specifically designed for Hindi-English code-mixed multimodal language settings. The model is developed in two variants: CMCLIP-RB, based on ResNet, and CMCLIP-VX, based on ViT, both of which adapt the original CLIP model to suit code-mixed data. We also introduce a large, novel dataset called Parallel Hybrid Multimodal Code-mixed Hinglish (PHMCH), which forms the foundation for teacher training. The CMCLIP models are evaluated on various downstream tasks, including code-mixed Image-Text Retrieval (ITR) and classification tasks, such as humor and sarcasm detection, using a code-mixed meme dataset. Our experimental results demonstrate that CMCLIP outperforms existing models, such as M3P and multilingual-CLIP, establishing state-of-the-art performance for code-mixed multimodal tasks. We would also like to assert that although our data and frameworks are on Hindi-English code-mix, they can be extended to any other code-mixed language settings.

pdf abs
Improving Few-shot Prompting using Cluster-based Sample Retrieval for Medical NER in Clinical Text
Meethu Mohan C | Sneha Shaji Punnan | Jeena Kleenankandy

Named Entity Recognition (NER) in the medical domain is crucial for extracting essential information from clinical text. Large Language Models (LLMs) have demonstrated remarkable capabilities in this task, but their performance is highly dependent on the quality of the prompts. Few-shot prompting or prompt-by-example, where the input query to LLM is augmented with one or more sample outputs, is a well-known technique in guiding the LLMs to the expected result. The quality of the sample in the prompt plays an important role in this task. This paper proposes to improve the performance of few-shot prompting for medical NER on clinical text using a cluster-based strategy for sample selection. We employ the concepts from Retrieval Augmented Generation (RAG) and K-means clustering to identify the most similar annotated examples for any given input text. Using these contextually relevant yet divergent training samples as examples, we guide the LLM toward extracting more accurate medical entities. Our experiments using the llama-2 model show that this approach significantly outperforms zero-shot prompting and random sampled few-shot prompting in two data sets chosen for this study, demonstrating the efficacy of cluster-based retrieval in improving few-shot prompting for medical NER tasks.

pdf abs
MalUpama - Figurative Language Identification in Malayalam -An Experimental Study
Reenu Paul | Wincy Abraham | Anitha S. Pillai

Figurative language, particularly in under represented languages within the Dravidian family, serves as a critical medium for conveying emotions and cultural meaning. Despite the rich literary traditions of languages such as Malayalam, Tamil, Telugu, and Kannada, there has been minimal progress in developing computational techniques to analyze figurative expressions. Historically, Malayalam was known by various names, such as Malayanma and Malabari. Similarly Kerala was known as Malanadu before adopting its current name, which metaphorically refers to the land between the Indian Ocean and the Western Ghats. In this study, we introduce the UPAMA Model(MalUpama), designed to identify Similes in Malayalam, an under-resourced Dravidian language mostly spoken in the state of southern India, Kerala. The current research focuses on detection of presence of Simile in Malayalam prose using the ‘Upama model’. This paper outlines the detection of Simile in Malayalam sentences and a detection accuracy of 94.5% is achieved by the proposed method. To the best of our knowledge this is the first work in the Malayalam language, explores computational techniques with a particular focus on applying machine learning to analyze figurative expressions which can be adopted for other Dravidian Languages too. The dataset developed for this study is made publicly available, allowing scholars to contribute and explore more on the category ‘Upama’ of Figurative Languages (‘Alankarangal’) of Malayalam language.

pdf abs
Integration of Self-Attention Model with Intralingual Word Embedding for Contextual Semantic Analysis of Thirukkural Text
Shanthi Murugan | Kaviyarasu S | Balasundaram S R

Thirukkural, one of the ancient works of Tamil Literature, is popular worldwide due to the moral values and practices it teaches to the society. Understanding the verses with meaning, especially context, is important. In this regard, this paper introduces a system designed to generate contextualized word meanings for the couplets of the Thirukkural, tailored to assist school children in understanding the text more effectively. Unlike traditional methods that provide detailed explanations in paragraph form, our method focuses on word-by-word interpretation, based on context through an integrated self-attention model. By combining the self-attention mechanism with FastText embeddings, our approach achieves improved performance over state-of-the-art models such as Word2Vec and standalone FastText. We evaluate the semantic understanding of the Thirukkural text using metrics as manual scoring. Tamil Thirukkural Agarathi serves as the gold-standard dataset for evaluation, demonstrating the effectiveness of our approach in capturing the nuanced semantics of the Thirukkural.

Extracting information from genomic reports of cancer patients is crucial for both healthcare professionals and cancer research. While Large Language Models (LLMs) have shown promise in extracting information, their potential for handling genomic reports remains unexplored. These reports are complex, multi-page documents that feature a variety of visually rich, structured layouts and contain many domain-specific terms. Two primary challenges complicate the process: (i) extracting data from PDFs with intricate layouts and domain-specific terminology and (ii) dealing with variations in report layouts from different laboratories, making extraction layout-dependent and posing challenges for subsequent data processing. To tackle these issues, we propose GR-PROMPT, a prompt-based technique, and GR-FORMAT, a standardized format. Together, these two convert a genomic report in PDF format into GR-FORMAT as a JSON file using a multimodal LLM. To address the lack of available datasets for this task, we introduce GR-DATASET, a synthetic collection of 100 cancer genomic reports in PDF format. Each report is accompanied by key-value information presented in a layout-specific format, as well as structured key-value information in GR-FORMAT. This is the first dataset in this domain to promote further research for the task. We performed our experiment on this dataset.

pdf abs
RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages
Harshvivek Ankush Kashid | Pushpak Bhattacharyya

Optical Character Recognition (OCR) technology has revolutionized the digitization of printed text, enabling efficient data extraction and analysis across various domains. Just like Machine Translation systems, OCR systems are prone to errors. In this work, we address the challenge of data generation and post-OCR error correction, specifically for low-resource languages. We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR, that tackles the scarcity of the post-OCR Error Correction datasets for low-resource languages. We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit. We also present a novel approach for OCR error correction by leveraging techniques from machine translation. Our method involves translating erroneous OCR output into a corrected form by treating the OCR errors as mistranslations in a parallel text corpus, employing pre-trained transformer models to learn the mapping from erroneous to correct text pairs, effectively correcting OCR errors.

pdf abs
Survey on Computational Approaches to Implicature
Kaveri Anuranjana | Srihitha Mallepally | Sriharshitha Mareddy | Amit Shukla | Radhika Mamidi

This paper explores the concept of solving implicature in Natural Language Processing (NLP), highlighting its significance in understanding indirect communication. Drawing on foundational theories by Austin, Searle, and Grice, we discuss how implicature extends beyond literal language to convey nuanced meanings. We review existing datasets, including the Pragmatic Understanding Benchmark (PUB), that assess models’ capabilities in recognizing and interpreting implicatures. Despite recent advances in large language models (LLMs), challenges remain in effectively processing implicature due to limitations in training data and the complexities of contextual interpretation. We propose future directions for research, including the enhancement of datasets and the integration of pragmatic reasoning tasks, to improve LLMs’ understanding of implicature and facilitate better human-computer interaction.

Sentiment Analysis plays a crucial role in understanding user opinions in various languages. The paper presents an experiment with a sentiment analysis model fine-tuned on Marathi sentences to classify sentiments into positive, negative, and neutral categories. The fine-tuned model shows high accuracy when tested on Konkani sentences, despite not being explicitly trained on Konkani data; since Marathi is a language very close to Konkani. This outcome highlights the effectiveness of Zero-shot learning, where the model generalizes well across linguistically similar languages. Evaluation metrics such as accuracy, balanced accuracy, negative accuracy, neutral accuracy, positive accuracy and confusion matrix scores were used to assess the performance, with Konkani sentences demonstrating superior results. These findings indicate that zero-shot sentiment analysis can be a powerful tool for sentiment classification in resource poor languages like Konkani, where labeled data is limited. The method can be used to generate datasets for resource-poor languages. Furthermore, this suggests that leveraging linguistically similar languages can help generate datasets for low-resource languages, enhancing sentiment analysis capabilities where labeled data is scarce. By utilizing related languages, zero-shot models can achieve meaningful performance without the need for extensive labeled data for the target language.

pdf abs
Survey of Pseudonymization, Abstractive Summarization & Spell Checker for Hindi and Marathi
Rasika Ransing | Mohammed Amaan Dhamaskar | Ayush Rajpurohit | Amey Dhoke | Sanket Dalvi

India’s vast linguistic diversity presents unique challenges and opportunities for technological advancement, especially in the realm of Natural Language Processing (NLP). While there has been significant progress in NLP applications for widely spoken languages, the regional languages of India, such as Marathi and Hindi, remain underserved. Research in the field of NLP for Indian regional languages is at a formative stage and holds immense significance. The paper aims to build a platform which enables the user to use various features like text anonymization, abstractive text summarization and spell checking in English, Hindi and Marathi language. The aim of these tools is to serve enterprise and consumer clients who predominantly use Indian Regional Languages.

pdf abs
Synthetic Data and Model Dynamics based Performance Analysis for Assamese-Bodo Low Resource NMT
Kuwali Talukdar | Shikhar Kumar Sarma | Kishore Kashyap

This paper presents details of modelling and performance analysis of Neural Machine Translation (NMT) for the low-resource Assamese-Bodo language pair, focusing on model tuning and the use of synthetic data. Given the scarcity of parallel corpora for these languages, synthetic data generation techniques, such as back-translation, were employed to enhance translation performance. The NMT architecture was used along with necessary preprocessing steps as per the NMT pipeline. Experimentation across varying model parameters have been performed and scores are recorded. The model’s performance was evaluated using the BLEU score, which showed significant improvement when synthetic data was incorporated into the training process. While a base model with gold standard data of relatively smaller size yielded Overall BLEU of 11.35, optimized tuned model with synthetic data has resulted considerable improvement in BLEU scores across the domains, with overall BLEU upto 14.74. Challenges related to data scarcity and model optimization are also discussed, along with potential future improvements.

pdf abs
End to End Multilingual Coreference Resolution for Indian Languages
Sobha Lalitha Devi | Vijay Sundar Ram | Pattabhi RK Rao

This paper describes an approach on an end to end model for Multilingual Coreference Resolution (CR) for low resource languages such as Tamil, Malayalam and Hindi. We have done fine tune the XLM-Roberta large model on multilingual training dataset using specific languages with linguistic features and without linguistic features. XLM-R with linguistic features achieves better results than the baseline system. This shows that giving the linguistic knowledge enriches the system performance. The performance of the system is comparable with the state of the art systems.

pdf abs
LangBot-Language Learning Chatbot
Madhubala Sundaram | Pattabhi RK Rao | Sobha Lalitha Devi

Chatbots are being widely used in educational domain to revolutionize how students interact and learn along with traditional methods of learning. This paper presents our work on LangBot, a chatbot developed for learning Tamil language. LangBot developed integrates the interactive features of chatbots with the study material of the Tamil courses offered by Tamil Virtual Academy, Government of Tamil Nadu. LangBot helps students in enhancing their learning skills and increases their interest in learning the language. Using semi-automatic methods, we generate question and answers related to all topics in the courses. We then develop a generative language model and also Retrieval Augmented Generation (RAG) so that the system can incorporate new syllabus changes. We have performed manual user studies. The results obtained are encouraging. This approach offers learners an interactive tool that aligns with their syllabus. It is observed that this enriches the overall learning experience.