pdf
bib
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Jyoti D. Pawar
|
Sobha Lalitha Devi
pdf
bib
abs
IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images
Varuna Krishna Kolla
|
Suryavardan Suresh
|
Shreyash Mishra
|
Sathyanarayanan Ramamoorthy
|
Parth Patwa
|
Megha Chakraborty
|
Aman Chadha
|
Amitava Das
|
Amit Sheth
Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis You shall know a word by the company it keeps (Harris, 1954), whereas modern prediction- based neural network embeddings rely on de- sign choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects. JE is a way to encode multimodal data into a vec- tor space where the text modality serves as the grounding key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three in- dividual representations: (i) object-object co- location, (ii) word-object co-location, and (iii) word-object correlation. These three ways cap- ture complementary aspects of the two modal- ities which are further combined to obtain the final object-word JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextual- ity and real-world analogies. We also evalu- ate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Im- age2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned downstream tasks by out- performing the current SoTA on all the selected tasks. The code is available at https:// github.com/varunakk/IMAGINATOR.
pdf
bib
Evaluating user preferences in Hindi Text-to-Speech
Bharat Gupta
pdf
abs
Multi-Hop Relation Aware Representations for Inductive Knowledge Graphs
Aniruddha Bala
|
Ankit Sharma
|
Shlok Sharma
|
Pinaki Bhaskar
Recent knowledge graph (KG) embedding methods explore parameter-efficient representations for large-scale KGs. These techniques learn entity representation using a fixed size vocabulary. Such a vocabulary consists of all the relations and a small subset of the full entity set, referred to as anchors. An entity is hence expressed as a function of reachable anchors and immediate relations. The performance of these methods is, therefore, largely dependent on the entity tokenization strategy. Especially in inductive settings, the representation capacity of these embeddings is limited due to the absence of anchor entities, as unseen entities have no connection with training graph entities. In this work, we propose a novel entity tokenization strategy that tokenizes an entity into a set of anchors based on relation similarity and relational paths. Our model MH-RARe overcomes the challenge of unseen entities not being directly connected to the anchors by selecting informative anchors from the training graph using relation similarity. Experiment results show that our model outperforms the baselines on multiple datasets for inductive knowledge graph completion task, attaining upto 5% improvement, while maintaining parameter efficiency.
pdf
abs
Pronunciation-Aware Syllable Tokenizer for Nepali Automatic Speech Recognition System
Rupak Raj Ghimire
|
Bal Krishna Bal
|
Balaram Prasain
|
Prakash Poudyal
The Automatic Speech Recognition (ASR) has come up with significant advancements over the course of several decades, transitioning from a rule-based method to a statistical approach, and ultimately to the use of end-to-end (E2E) frameworks. This phenomenon continues with the progression of machine learning and deep learning methodologies. The E2E approach for ASR has demonstrated predominant success in the case of resourceful languages with larger annotated corpus. However, the accuracy is quite low for low-resourced languages such as Nepali. In this regard, language-specific tools such as tokenizers seem to play a vital role in improving the performance of the E2E model for low-resourced languages like Nepali. In this paper, we propose a pronunciationaware syllable tokenizer for the Nepali language which improves the results of the E2E model. Our experiment confirm that the introduction of the proposed tokenizer yields better performance with the Character Error Rate (CER) 8.09% compared to other language-independent tokenizers.
pdf
abs
Neural language model embeddings for Named Entity Recognition: A study from language perspective
Muskaan Maurya
|
Anupam Mandal
|
Manoj Maurya
|
Naval Gupta
|
Somya Nayak
Named entity recognition (NER) models based on neural language models (LMs) exhibit stateof-the-art performance. However, the performance of such LMs have not been studied in detail with respect to finer language related aspects in the context of NER tasks. Such a study will be helpful in effective application of these models for cross-lingual and multilingual NER tasks. In this study, we examine the effects of script, vocabulary sharing, foreign names and pooling of multilanguage training data for building NER models. It is observed that monolingual BERT embeddings show the highest recognition accuracy among all transformerbased LMs for monolingual NER models. It is also seen that vocabulary sharing and data augmentation with foreign named entities (NEs) are most effective towards improving accuracy of cross-lingual NER models. Multilingual NER models trained by pooling data from similar languages can address training data inadequacy and exhibit performance close to that of monolingual models trained with adequate NER-tagged data of a single language.
pdf
abs
Understanding behaviour of large language models for short-term and long-term fairness scenarios
Talha Chafekar
|
Aafiya Hussain
|
Chon In Cheong
Large language models (LLMs) have become increasingly accessible online, thus they can be easily used to generate synthetic data for technology. With the rising capabilities of LLMs, their applications span across many domains. With its increasing use for automating tasks, it is crucial to understand the fairness notions harboured by these models. Our work aims to explore the consistency and behaviour of GPT3.5, GPT-4 in both short-term and long-term scenarios through the lens of fairness. Additionally, the search for an optimal prompt template design for equalized opportunities has been investigated in this study. In the short-term scenario for the German Credit dataset, an intervention to a key feature recorded an increase in loan rejection rate by 37.15% for GPT-3.5 and 49.52% for GPT-4. In the long-term scenario for ML fairness gym, adding extra information about the environment to the prompts has shown no improvement to the prompt with minimal information in terms of final credit distributions. However, adding extra features to the prompt has increased the profit rate by 6.41% (from 17.2% to 23.6%) compared to a baseline maximum-reward classifier with compromising group-level recall rates.
pdf
abs
Identifying Intent-Sentiment Co-reference from Legal Utterances
Pinaki Karkun
|
Dipankar Das
Co-reference is always treated as one of challenging tasks under natural language processing and has been explored only in the domain of anaphora resolution to an extent. However, the benefit of it to identify the relations between multiple entities in a single context can be explored better while we aim to identify intent and sentiment from the utterances of a dialogue or conversation. The utilization of co-reference becomes more elegant while tracking users’ intents with respect to their corresponding sentiments explored in a specialized domain like judiciary. Thus, in the present attempt, we have identified not only intent and sentiment expressions at token level in an individual manner, we also classified the utterances and identified the co-reference between intent and sentiment entities in utterance level context. Last but not the least, the deep learning algorithms have shown improvements over traditional machine learning in all cases.
pdf
abs
An Annotated Corpus for Realis Event Detection in Short Stories Written in English and Low Resource Assamese Language
Chaitanya Kirti
|
Pankaj Choudhury
|
Ashish Anand
|
Prithwijit Guha
This paper presents an annotated corpora of Assamese and English short stories for event trigger detection. This marks a pioneering endeavor in short stories, contributing to developing resources for this genre, especially in the low-resource Assamese language. In the process, 200 short stories were manually annotated in both Assamese and English. The dataset was evaluated and several models were compared for predicting events that are actually happening, i.e., realis events. However, it is expensive to develop manually annotated language resources, especially when the text requires specialist knowledge to interpret. In this regard, TagIT, an automated event annotation tool, is introduced. TagIT is designed to facilitate our objective of expanding the dataset from 200 to 1,000. The best-performing model was employed in TagIT to automate the event annotation process. Extensive experiments were conducted to evaluate the quality of the expanded dataset. This study further illustrates how the combination of an automatic annotation tool and human-in-the-loop participation significantly reduces the time needed to generate a high-quality dataset.
pdf
abs
Active Learning Approach for Fine-Tuning Pre-Trained ASR Model for a Low-Resourced Language: A Case Study of Nepali
Rupak Raj Ghimire
|
Bal Krishna Bal
|
Prakash Poudyal
Fine tuning of the pre-trained language model is a technique which can be used to enhance the technologies of low-resourced languages. The unsupervised approach can fine-tune any pre-trained model with minimum or even no language-specific resources. It is highly advantageous, particularly for languages that possess limited computational resources. We present a novel approach for fine-tuning a pre-trained Automatic Speech Recognition (ASR) model that is suitable for low resource languages. Our methods involves iterative fine-tuning of pre-trained ASR model. mms-1b is selected as the pretrained seed model for fine-tuning. We take the Nepali language as a case study for this research work. Our approach achieved a CER of 6.77%, outperforming all previously recorded CER values for the Nepali ASR Systems.
pdf
abs
Dispersed Hierarchical Attention Network for Machine Translation and Language Understanding on Long Documents with Linear Complexity
Ajay Mukund S.
|
Easwarakumar K.s.
Transformers, being the forefront of Natural Language Processing and a pioneer in the recent developments, we tweak the very fundamentals of the giant Deep Learning model in this paper. For long documents, the conventional Full SelfAttention exceeds the compute power and the memory requirement as it scales quadratically. Instead, if we use a Local Self-Attention using a sliding window, we lose the global context present in the input document which can impact the performance of the task in hand. For long documents (ranging from 500 to 16K tokens), the proposed Dispersed Hierarchical Attention component captures the local context using a sliding window and the global context using a linearlyscaled dispersion approach. This achieves O(N) linear complexity, where N is the length of the input sequence or document.
pdf
abs
Analyzing Sentiment Polarity Reduction in News Presentation through Contextual Perturbation and Large Language Models
Alapan Kuila
|
Somnath Jena
|
Sudeshna Sarkar
|
Partha Pratim Chakrabarti
In today’s media landscape, where news outlets play a pivotal role in shaping public opinion, it is imperative to address the issue of sentiment manipulation within news text. News writers often inject their own biases and emotional language, which can distort the objectivity of reporting. This paper introduces a novel approach to tackle this problem by reducing the polarity of latent sentiments in news content. Drawing inspiration from adversarial attack-based sentence perturbation techniques and a promptbased method using ChatGPT, we employ transformation constraints to modify sentences while preserving their core semantics. Using three perturbation methods—replacement, insertion, and deletion—coupled with a contextaware masked language model, we aim to maximize the desired sentiment score for targeted news aspects through a beam search algorithm. Our experiments and human evaluations demonstrate the effectiveness of these two models in achieving reduced sentiment polarity with minimal modifications while maintaining textual similarity, fluency, and grammatical correctness. Comparative analysis confirms the competitive performance of the adversarial attack-based perturbation methods and promptbased methods, offering a promising solution to foster more objective news reporting and combat emotional language bias in the media.
pdf
abs
NLI to the Rescue: Mapping Entailment Classes to Hallucination Categories in Abstractive Summarization
Naveen Badathala
|
Ashita Saxena
|
Pushpak Bhattacharyya
In this paper, we detect hallucinations in summaries generated by abstractive summarization models. We focus on three types of hallucination viz. intrinsic, extrinsic, and nonhallucinated. The method used for detecting hallucination is based on textual entailment. Given a premise and a hypothesis, textual entailment classifies the hypothesis as contradiction, neutral, or entailment. These three classes of textual entailment are mapped to intrinsic, extrinsic, and non-hallucinated respectively. We fine-tune a RoBERTa-large model on NLI datasets and use it to detect hallucinations on the XSumFaith dataset. We demonstrate that our simple approach using textual entailment outperforms the existing factuality inconsistency detection systems by 12% and we provide insightful analysis of all types of hallucination. To advance research in this area, we create and release a dataset, XSumFaith++, which contains balanced instances of hallucinated and non-hallucinated summaries.
pdf
abs
Text Detoxification as Style Transfer in English and Hindi
Sourabrata Mukherjee
|
Akanksha Bansal
|
Atul Kr. Ojha
|
John P. McCrae
|
Ondrej Dusek
This paper focuses on text detoxification, i.e., automatically converting toxic text into nontoxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text’s style changes while its content is preserved. We present three approaches: (i) knowledge transfer from a similar task (ii) multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and (iii) delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al. (2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxification while preserving the actual content and maintaining fluency.
pdf
abs
Hindi Causal TimeBank: an Annotated Causal Event Corpus
Tanvi Kamble
|
Manish Shrivastava
Events and states have gained importance in NLP and information retrieval for being semantically rich temporal and spatial information indicators. Event causality helps us identify which events are necessary for another event to occur. The cause-effect event pairs can be relevant for multiple NLP tasks like question answering, summarization, etc. Multiple efforts have been made to identify causal events in documents but very little work has been done in this field in the Hindi language. We create an annotated corpus for detecting and classifying causal event relations on top of the Hindi Timebank (Goel et al., 2020), the ‘Hindi Causal Timebank’ (Hindi CTB). We introduce semantic causal relations like Purpose, Reason, and Enablement inspired from Bejan and Harabagiu (2008)’s annotation scheme and add some special cases particular to Hindi language.
pdf
abs
Enriching Electronic Health Record with Semantic Features UtilisingPretrained Transformers
Lena AlMutair
|
Eric Atwell
|
Nishant Ravikumar
Electronic Health Records (EHRs) have revolutionised healthcare by enhancing patient care and facilitating provider communication. Nevertheless, the efficient extraction of valuable information from EHRs poses challenges, primarily due to the overwhelming volume of unstructured data, the wide variability in data formats, and the lack of standardised labels. Leveraging deep learning and concept embeddings, we address the gap in context-aware systems for EHRs. The proposed solution was evaluated on the MIMIC III dataset and demonstrated superior performance compared to other methodologies. We addressed the positive impact of the latent feature combined with the note representation in four different settings. Model performance was evaluated using a case study conducted with BertScore, assessing precision, recall, and F1 scores. The model excels in Medical Natural Language Inference (MedNLI) with an 89.3% accuracy, further boosted to 90.5% through retraining the embeddings using International Classification of Diseases (ICD) codes, which we formally designate as ClinicNarrIR. The ClinicNarrIR was tested with 1000 randomly sampled notes, achieving an N DCG@10 score of approximately 0.54 with accuracy@10 of 0.85. The study also demonstrates a high correlation between the results produced by the proposed representation and medical coders. Notably, in all evaluation cases, the optimal base pretrained model that emerged was BlueBERT.
pdf
abs
Multilingual Multimodal Text Detection in Indo-Aryan Languages
Nihar Jyoti Basisth
|
Eisha Halder
|
Tushar Sachan
|
Advaitha Vetagiri
|
Partha Pakray
Multi-language text detection and recognition in complex visual scenes is an essential yet challenging task. Traditional pipelines relying on optical character recognition (OCR) often fail to generalize across different languages, fonts, orientations and imaging conditions. This work proposes a novel approach using the YOLOv5 object detection model architecture for multilanguage text detection in images and videos. We curate and annotate a new dataset of over 4,000 scene text images across 4 Indian languages and use specialized data augmentation techniques to improve model robustness. Transfer learning from a base YOLOv5 model pretrained on COCO is combined with tailored optimization strategies for multi-language text detection. Our approach achieves state-of-theart performance, with over 90% accuracy on multi-language text detection across all four languages in our test set. We demonstrate the effectiveness of fine-tuning YOLOv5 for generalized multi-language text extraction across diverse fonts, scales, orientations, and visual contexts. Our approach’s high accuracy and generalizability could enable numerous applications involving multilingual text processing from imagery and video.
pdf
abs
Iterative Back Translation Revisited: An Experimental Investigation for Low-resource English Assamese Neural Machine Translation
Mazida Akhtara Ahmed
|
Kishore Kashyap
|
Kuwali Talukdar
|
Parvez Aziz Boruah
Back Translation has been an effective strategy to leverage monolingual data both on the source and target sides. Research have opened up several ways to improvise the procedure, one among them is iterative back translation where the monolingual data is repeatedly translated and used for re-training for the model enhancement. Despite its success, iterative back translation remains relatively unexplored in low-resource scenarios, particularly for rich Indic languages. This paper presents a comprehensive investigation into the application of iterative back translation to the low-resource English-Assamese language pair. A simplified version of iterative back translation is presented. This study explores various critical aspects associated with back translation, including the balance between original and synthetic data and the refinement of the target (backward) model through cleaner data retraining. The experimental results demonstrate significant improvements in translation quality. Specifically, the simplistic approach to iterative back translation yields a noteworthy +6.38 BLEU score improvement for the EnglishAssamese translation direction and a +4.38 BLEU score improvement for the AssameseEnglish translation direction. Further enhancements are further noticed when incorporating higher-quality, cleaner data for model retraining highlighting the potential of iterative back translation as a valuable tool for enhancing low-resource neural machine translation (NMT).
pdf
abs
Issues in the computational processing of Upamāalaṅkāra.
Bhakti Jadhav
|
Amruta Barbadikar
|
Amba Kulkarni
|
Malhar Kulkarni
Processing and understanding of figurative speech is a challenging task for computers as well as humans. In this paper, we present a case of Upamā alaṅkāra (simile). The verbal cognition of the Upamā alaṅkāra by a human is presented as a dependency tree, which involves the identification of various components such as upamāna (vehicle), upameya (topic), sādhāran.a-dharma (common property) and upamādyotaka (word indicating similitude). This involves the repetition of elliptical elements. Further, we show, how the same dependency tree may be represented without any loss of information, even without repetition of elliptical elements. Such a representation would be useful for the computational processing of the alaṅkāras.
pdf
abs
Impacts of Approaches for Agglutinative-LRL Neural Machine Translation (NMT): A Case Study on Manipuri-English Pair
Gourashyam Moirangthem
|
Lavinia Nongbri
|
Samarendra Singh Salam
|
Kishorjit Nongmeikapam
Neural Machine Translation (NMT) is known to be extremely challenging for Low-Resource Languages (LRL) with complex morphology. This work deals with the NMT of a specific LRL called Manipuri/Meeteilon, which is a highly agglutinative language where words have extensive suffixation with limited prefixation. The work studies and discusses the impacts of approaches to mitigate the issues of NMT involving agglutinative LRL in a strictly low-resource setting. The research work experimented with several methods and techniques including subword tokenization, tuning of the selfattention-based NMT model, utilization of monolingual corpus by iterative backtranslation, embedding-based sentence filtering for back translation. This research work in the strictly low resource setting of only 21204 training sentences showed remarkable results with a BLEU score of 28.17 for Manipuri to English translation.
pdf
abs
KITLM: Domain-Specific Knowledge InTegration into Language Models for Question Answering
Ankush Agarwal
|
Sakharam Gawade
|
Amar Prakash Azad
|
Pushpak Bhattacharyya
Large language models (LLMs) have demon- strated remarkable performance in a wide range of natural language tasks. However, as these models continue to grow in size, they face sig- nificant challenges in terms of computational costs. Additionally, LLMs often lack efficient domain-specific understanding, which is par- ticularly crucial in specialized fields such as aviation and healthcare. To boost the domain- specific understanding, we propose, KITLM 1 , a novel knowledge base integration approach into language model through relevant informa- tion infusion. By integrating pertinent knowl- edge, not only the performance of the lan- guage model is greatly enhanced, but the model size requirement is also significantly reduced while achieving comparable performance. Our proposed knowledge-infused model surpasses the performance of both GPT-3.5-turbo and the state-of-the-art knowledge infusion method, SKILL, achieving over 1.5 times improvement in exact match scores on the MetaQA. KITLM showed a similar performance boost in the avi- ation domain with AeroQA. The drastic perfor- mance improvement of KITLM over the exist- ing methods can be attributed to the infusion of relevant knowledge while mitigating noise. In addition, we release two curated datasets to accelerate knowledge infusion research in specialized fields: a) AeroQA, a new bench- mark dataset designed for multi-hop question- answering within the aviation domain, and b) Aviation Corpus, a dataset constructed from unstructured text extracted from the National Transportation Safety Board reports. Our re- search contributes to advancing the field of domain-specific language understanding and showcases the potential of knowledge infusion techniques in improving the performance.
pdf
abs
Neural Machine Translation for a Low Resource Language Pair: English-Bodo
Parvez Aziz Boruah
|
Kuwali Talukdar
|
Mazida Akhtara Ahmed
|
Kishore Kashyap
This paper represent a work done on Neural Machine Translation for English and Bodo language pair. English is a language spoken around the world whereas, Bodo is a language mostly spoken in North Eastern area of India. This work of machine translation is done on a relatively small size of parallel data as there is less parallel corpus available for english bodo pair. Corpus is generally taken from available source National Platform of Language Technology(NPLT), Data Management Unit(DMU), Mission Bhashini, Ministry of Electronics and Information Technology and also generated internally in-house. Tokenization of raw text is done using IndicNLP library and mosesdecoder for Bodo and English respectively. Subword tokenization is performed by using BPE(Byte Pair Encoder) , Sentencepiece and Wordpiece subword. Experiments have been done on two different vocab size of 8000 and 16000 on a total of around 92410 parallel sentences. Two standard transformer encoder and decoder models with varying number of layers and hidden size are build for training the data using OpenNMT-py framework. The result are evaluated based on the BLEU score on an additional testset for evaluating the performance. The highest BLEU score of 11.01 and 14.62 are achieved on the testset for English to Bodo and Bodo to English translation respectively.
pdf
abs
Bi-Quantum Long Short-Term Memory for Part-of-Speech Tagging
Shyambabu Pandey
|
Partha Pakray
Natural language processing (NLP) is a subfield of artificial intelligence that enables computer systems to understand and generate human language. NLP tasks involved machine learning and deep learning methods for processing the data. Traditional applications utilize massive datasets and resources to perform NLP applications, which is challenging for classical systems. On the other hand, Quantum computing has emerged as a promising technology with the potential to address certain computational problems more efficiently than classical computing in specific domains. In recent years, researchers have started exploring the application of quantum computing techniques to NLP tasks. In this paper, we propose a quantum-based deep learning model, Bi-Quantum long short-term memory (BiQLSTM). We apply POS tagging using the proposed model on social media code-mixed datasets.
pdf
abs
Sentiment Analysis for the Mizo Language: A Comparative Study of Classical Machine Learning and Transfer Learning Approaches
Mercy Lalthangmawii
|
Thoudam Doren Singh
Sentiment analysis, a subfield of natural language processing (NLP) has witnessed significant advancements in the analysis of usergenerated contents across diverse languages. However, its application to low-resource languages remains a challenge. This research addresses this gap by conducting a comprehensive sentiment analysis experiment in the context of the Mizo language, a low-resource language predominantly spoken in the Indian state of Mizoram and neighboring regions. Our study encompasses the evaluation of various machine learning models including Support Vector Machine (SVM), Decision Tree, Random Forest, K-Nearest Neighbor (K-NN), Logistic Regression and transfer learning using XLM-RoBERTa. The findings reveal the suitability of SVM as a robust performer in Mizo sentiment analysis demonstrating the highest F1 Score and Accuracy among the models tested. XLM-RoBERTa, a transfer learning model exhibits competitive performance highlighting the potential of leveraging pre-trained multilingual models in low-resource language sentiment analysis tasks. This research advances our understanding of sentiment analysis in lowresource languages and serves as a stepping stone for future investigations in this domain.
pdf
abs
Bidirectional Neural Machine Translation (NMT) using Monolingual Data for Khasi-English Pair
Lavinia Nongbri
|
Gourashyam Moirangthem
|
Samarendra Salam
|
Kishorjit Nongmeikapam
Due to a lack of parallel data, low-resource language machine translation has been unable to make the most of Neural Machine Translation. This paper investigates several approaches as to how low-resource Neural Machine Translation can be improved in a strictly low-resource setting, especially for bidirectional Khasi-English language pairs. The back-translation method is used to expand the parallel corpus using monolingual data. The work also experimented with subword tokenizers to improve the translation accuracy for new and rare words. Transformer, a cutting-edge NMT model, serves as the backbone of the bidirectional Khasi-English machine translation. The final Khasi-to-English and English-to-Khasi NMT models trained using both authentic and synthetic parallel corpora show an increase of 2.34 and 3.1 BLEU scores, respectively, when compared to the models trained using only authentic parallel dataset.
pdf
abs
Lost in Translation No More: Fine-tuned transformer-based models for CodeMix to English Machine Translation
Arindam Chatterjee
|
Chhavi Sharma
|
Yashwanth V.p.
|
Niraj Kumar
|
Ayush Raj
|
Asif Ekbal
Codemixing, the linguistic phenomenon where a speaker alternates between two or more languages within a conversation or even a single utterance, presents a significant challenge for machine translation systems due to its syntactic complexity and contextual nuances. This paper introduces a set of advanced transformerbased models fine-tuned specifically for translating codemixed text to English, more specifically, Hindi-English (colloquially referred to as Hinglish) codemixed text into English. Unlike standard bilingual corpora, codemixed data requires an understanding of the intricacies of grammatical structures and cultural contexts embedded within the language blend. Existing machine translation efforts in codemixed languages have largely been constrained by the paucity of robust datasets and models that can capture the nuanced semantic and syntactic interplay characteristic of such languages. We present a novel dataset PACMAN trans for Hinglish to English machine translation, based on the PACMAN strategy, meticulously curated to represent natural codemixing patterns. Our generic fine-tuned translation models trained on the novel data outperforms current state-of-theart Large Language Models (LLMs) by 38% in terms of BLEU score. Further, when fine-tuned on custom benchmark datasets, our focused dual fine-tuned models surpass the PHINC dataset BLEU score benchmark by 22%. Our comparative analysis illustrates significant improvements in translation quality, showcasing the potential of fine-tuning transformer models in bridging the linguistic divide in codemixed language translation. The success of our models reflects a promising step forward in the quest to provide seamless translation services for the ever-growing multilingual population and the complex linguistic phenomena they generate.
pdf
abs
Automated System for Opinion Detection of Breathing Problem Discussions in Medical Forum Using Deep Neural Network
Somenath Nag Choudhury
|
Asif Ekbal
Chest X-ray radiology majorly focuses on diseases like consolidation, pneumothorax, pleural effusion, lung collapse, etc., causing breathing and circulation problems. A tendency to share such problems in the forums for an answer without revealing personal demographics is also very common. However, we have observed more visitors than authors, which leads to a very poor average reply per discussion (3 to 12 only), and also many left with no or late replies in the forums. To alleviate the process, and ease of acquiring the best replies from multiple discussions, we propose a supervised learning framework by automatic scrapping and annotation of breathing problem-related group discussions from the patient.info 1 forum and determine the associated sentiment of the most voted respondent post using Bi-LSTM. We assume the most voted reply is the most factual and experienced. We mainly scrapped and determined the sentiment of bronchiectasis, asthma, pneumonia, and respiratory diseaserelated posts. After filtering and augmentation, a total of 1,748 posts were used for training our Stacked Bi-LSTM model and achieved an overall accuracy of 90%.
pdf
abs
Effect of Pivot Language and Segment-Based Few-Shot Prompting for Cross-Domain Multi-Intent Identification in Low Resource Languages
Kathakali Mitra
|
Aditha Venkata Santosh Ashish
|
Soumya Teotia
|
Aruna Malapati
NLU (Natural Language Understanding) has considerable difficulties in identifying multiple intentions across different domains in languages with limited resources. Our contributions involve utilizing pivot languages with similar semantics for NLU tasks, creating a vector database for efficient retrieval and indexing of language embeddings in high-resource languages for Retrieval Augmented Generation (RAG) in low-resource languages, and thoroughly investigating the effect of segmentbased strategies on complex user utterances across multiple domains and intents in the development of a Chain of Thought Prompting (COT) combined with Retrieval Augmented Generation. The study investigated recursive approaches to identify the most effective zeroshot instances for segment-based prompting. A comparison analysis was conducted to compare the effectiveness of sentence-based prompting vs segment-based prompting across different domains and multiple intents. This research offers a promising avenue to address the formidable challenges of NLU in low-resource languages, with potential applications in conversational agents and dialogue systems and a broader impact on linguistic understanding and inclusivity.
pdf
abs
Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Language
Vandan Mujadia
|
Pruthwik Mishra
|
Arafat Ahsan
|
Dipti M. Sharma
With the primary focus on evaluating the effectiveness of large language models for automatic reference-less translation assessment, this work presents our experiments on mimicking human direct assessment to evaluate the quality of translations in English and Indian languages. We constructed a translation evaluation task where we performed zero-shot learning, in-context example-driven learning, and fine-tuning of large language models to provide a score out of 100, where 100 represents a perfect translation and 1 represents a poor translation. We compared the performance of our trained systems with existing methods such as COMET, BERT-Scorer, and LABSE, and found that the LLM-based evaluator (LLaMA2-13B) achieves a comparable or higher overall correlation with human judgments for the considered Indian language pairs (Refer figure 1).
pdf
abs
1-step Speech Understanding and Transcription Using CTC Loss
Karan Singla
|
Shahab Jalalv
|
Yeon-Jun Kim
|
Andrej Ljolje
|
Antonio Moreno Daniel
|
Srinivas Bangalore
|
Benjamin Stern
Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system’s vocabulary by adding a set of unused placeholder symbols, conceptually akin to the <pad> tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system’s proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.
pdf
abs
Consolidating Strategies for Countering Hate Speech Using Persuasive Dialogues
Sougata Saha
|
Rohini Srihari
Hateful comments are prevalent on social media platforms. Although tools for automatically detecting, flagging, and blocking such false, offensive, and harmful content online have lately matured, such reactive and brute force methods alone provide short-term and superficial remedies while the perpetrators persist. With the public availability of large language models which can generate articulate synthetic and engaging content at scale, there are concerns about the rapid growth of dissemination of such malicious content on the web. There is now a need to focus on deeper, long-term solutions that involve engaging with the human perpetrator behind the source of the content to change their viewpoint or at least bring down the rhetoric using persuasive means. To do that, we propose defining and experimenting with controllable strategies for generating counterarguments to hateful comments in online conversations. We experiment with controlling response generation using features based on (i) argument structure and reasoning-based Walton argument schemes, (ii) counter-argument speech acts, and (iii) human characteristicsbased qualities such as Big-5 personality traits and human values. Using automatic and human evaluations, we determine the best combination of features that generate fluent, argumentative, and logically sound arguments for countering hate. We further share the developed computational models for automatically annotating text with such features, and a silver-standard annotated version of an existing hate speech dialog corpora.
pdf
abs
Konkani ASR
Swapnil Fadte
|
Gaurish Thakkar
|
Jyoti D. Pawar
Konkani is a resource-scarce language, mainly spoken on the west coast of India. The lack of resources directly impacts the development of language technology tools and services. Therefore, the development of digital resources is required to aid in the improvement of this situation. This paper describes the work on the Automatic Speech Recognition (ASR) System for Konkani language. We have created the ASR by fine-tuning the whisper-small ASR model with 100 hours of Konkani speech corpus data. The baseline model showed a word error rate (WER) of 17, which serves as evidence for the efficacy of the fine-tuning procedure in establishing ASR accuracy for Konkani language.
pdf
abs
Query-Based Summarization and Sentiment Analysis for Indian Financial Text by leveraging Dense Passage Retriever, RoBERTa, and FinBERT
Numair Shaikh
|
Jayesh Patil
|
Sheetal Sonawane
With the ever-expanding pool of information accessible on the Internet, it has become increasingly challenging for readers to sift through voluminous data and derive meaningful insights. This is particularly noteworthy and critical in the context of documents such as financial reports and large-scale media reports. In the realm of finance, documents are typically lengthy and comprise numerical values. This research delves into the extraction of insights through text summaries from financial data, based on the user’s interests, and the identification of clues from these insights. This research presents a straightforward, allencompassing framework for conducting querybased summarization of financial documents, as well as analyzing the sentiment of the summary. The system’s performance is evaluated using benchmarked metrics, and it is compared to State-of-The-Art (SoTA) algorithms. Extensive experimentation indicates that the proposed system surpasses existing pre-trained language models.
pdf
abs
Bias Detection Using Textual Representation of Multimedia Contents
Karthik L. Nagar
|
Aditya Mohan Singh
|
Sowmya Rasipuram
|
Roshni Ramnani
|
Milind Savagaonkar
|
Anutosh Maitra
The presence of biased and prejudicial content in social media has become a pressing concern, given its potential to inflict severe societal damage. Detecting and addressing such bias is imperative, as the rapid dissemination of skewed content has the capacity to disrupt social harmony. Advanced deep learning models are now paving the way for the automatic detection of bias in multimedia content with human-like accuracy. This paper focuses on identifying social bias in social media images. Toward this, we curated a Social Bias Image Dataset (SBID), consisting of 300 bias/no-bias images. The images contain both textual and visual information. We scientifically annotated the dataset for four different categories of bias. Our methodology involves generating a textual representation of the image content leveraging state-of-the-art models of optical character recognition (OCR), image captioning, and character attribute extraction. Initially, we performed fine-tuning on a Bidirectional Encoder Representations from Transformers (BERT) network to classify bias and no-bias, as well as on a Bidirectional AutoRegressive Transformer (BART) network for bias categorization, utilizing an extensive textual corpus. Further, these networks were finetuned on the image dataset built by us SBID. The experimental findings presented herein underscore the effectiveness of these models in identifying various forms of bias in social media images. We will also demonstrate their capacity to discern both explicit and implicit bias.
pdf
abs
Annotated and Normalized Causal Relation Extraction Corpus for Improving Health Informatics
Samridhi Dev
|
Aditi Sharan
In the ever-expanding landscape of biomedical research, development of new cancer drugs has increased the likelihood of adverse drug reactions (ADRs). However, information about these ADRs is often buried in unstructured data, requiring the conversion of this data into a structured and labeled dataset to identify potential ADRs and associations between them, making the extraction of entities and the analysis of causal relations a pivotal task. Machine learning methods have been used to identify ADRs, but current literature has several gaps in coverage, superficial manual annotation, and a lack of a labeled ADR corpus specific to cancer and normalized entities. Current datasets are generated manually on the abstracts, limiting their scope. To address these limitations, the paper presents an algorithm that automatically constructs, annotates, normalizes entities specific to cancer and identifies causal relationships among entities using linguistics and grammatical properties, MetaMap and UMLS tools enabling efficient information retrieval. A further knowledge graph was created for a case report to visualize the causal relationships.
pdf
abs
T20NGD: Annotated corpus for news headlines classification in low resource language,Telugu.
Chindukuri Mallikarjuna
|
Sangeetha Sivanesan
News classification allows analysts and researchers to study trends over time. Based on classification, news platforms can provide readers with related articles. Many digital news platforms and apps use classification to offer personalized content for their users. While there are numerous resources accessible for news classification in various Indian languages, there is still a lack of extensive benchmark dataset specifically for the Telugu language. Our paper presents and describes the Telugu20news group dataset, where news has been collected from various online Telugu news channels. We describe in detail the accumulation and annotation of the proposed news headlines dataset. In addition, we conducted extensive experiments on our proposed news headlines dataset in order to deliver solid baselines for future work.
pdf
abs
Advancing Class Diagram Extraction from Requirement Text: A Transformer-Based Approach
Shweta X
|
Suyash Mittal
|
Suryansh Chauhan
The class diagram plays an important role in software development. As these diagrams are created using software requirement text, it helps to improve communication between the developers and the stakeholders. Thus, the automatic extraction of class diagrams enhances the speed of software development procedures. The research carried out in this direction mostly relies on rule-based methodologies and deep learning models. These methodologies have their drawbacks, such as the fact that large rulebased systems are complex to handle, whereas the word embeddings used in deep learning models are context-independent. Thus, the presented research work strives to extract the class diagram entities from the natural language text by employing a transformer-based model, as the embeddings generated by these models are context-dependent. The results have been compared with the existing procedure, and an ablation study has also been carried out to find out the relevance of each step in the extraction procedure. The analysis involved examining the true positive, false positive, and false negative rates for specific class diagram elements in separate case studies. As a result, an enhancement of 9–7% has been observed in the procedures used for extracting the resulting class diagrams.
pdf
abs
L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic languages
Aishwarya Mirashi
|
Srushti Sonavane
|
Purva Lingayat
|
Tejas Padhiyar
|
Raviraj Joshi
In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3CubeIndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/ l3cube-pune/indic-nlp.
pdf
abs
PoS to UPoS Conversion and Creation of UPoS Tagged Resources for Assamese Language
Kuwali Talukdar
|
Shikhar Kumar Sarma
This paper addresses the vital task of transitioning from traditional Part-of-Speech (PoS) tagging to Universal Part-of-Speech (UPoS) tagging within the linguistic context of the Assamese language. The paper outlines a comprehensive methodology for PoS to UPoS conversion and the creation of UPoS tagged resources, bridging the gap between localized linguistic analysis and universal standards. The significance of this work lies in its potential to enhance natural language processing and understanding for the Assamese language, contributing to broader multilingual applications. The paper details the data preparation and creation processes, annotation methods, and evaluation techniques, shedding light on the challenges and opportunities presented in the pursuit of linguistic universality. The contents of this research have implications for improving language technology in the Assamese language and can serve as a model for similar work in other regional languages. Mapping of standard PoS tagset applicable for Indian languages to that of the primary categories of the UPoS tagset is done with respect to the Assamese language lexical behaviour. Conversion of PoS tagged text corpus to UPoS taged corpus using this mapping, and then utilizing a Deep Learning based model trained on such a dataset to create a sizable UPoS tagged corpus, are presented in a structured flow. This paper is a step towards a more standardized, universal understanding of linguistic elements in a diverse and multilingual world.
pdf
abs
Mitigating Abusive Comment Detection in Tamil Text: A Data Augmentation Approach with Transformer Model
Reshma Sheik
|
Raghavan Balanathan
|
Jaya Nirmala S.
With the increasing number of users on social media platforms, the detection and categorization of abusive comments have become crucial, necessitating effective strategies to mitigate their impact on online discussions. However, the intricate and diverse nature of lowresource Indic languages presents a challenge in developing reliable detection methodologies. This research focuses on the task of classifying YouTube comments written in Tamil language into various categories. To achieve this, our research conducted experiments utilizing various multi-lingual transformer-based models along with data augmentation approaches involving back translation approaches and other pre-processing techniques. Our work provides valuable insights into the effectiveness of various preprocessing methods for this classification task. Our experiments showed that the Multilingual Representations for Indian Languages (MURIL) transformer model, coupled with round-trip translation and lexical replacement, yielded the most promising results, showcasing a significant improvement of over 15 units in macro F1-score compared to existing baselines. This contribution adds to the ongoing research to mitigate the adverse impact of abusive content on online platforms, emphasizing the utilization of diverse preprocessing strategies and state-of-the-art language models.
pdf
abs
Dravidian Fake News Detection with Gradient Accumulation based Transformer Model
Eduri Raja
|
Badal Soni
|
Samir Kumar Borgohain
|
Candy Lalrempuii
The proliferation of fake news poses a significant challenge in the digital era. Detecting false information, especially in non-English languages, is crucial to combating misinformation effectively. In this research, we introduce a novel approach for Dravidian fake news detection by harnessing the capabilities of the MuRIL transformer model, further enhanced by gradient accumulation techniques. Our study focuses on the Dravidian languages, a diverse group of languages spoken in South India, which are often underserved in natural language processing research. We optimize memory usage, stabilize training, and improve the model’s overall performance by accumulating gradients over multiple batches. The proposed model exhibits promising results in terms of both accuracy and efficiency. Our findings underline the significance of adapting state-ofthe-art techniques, such as MuRIL-based models and gradient accumulation, to non-English languages to address the pressing issue of fake news.
pdf
abs
Automatic Speech Recognition System for Malasar Language using Multilingual Transfer Learning
Basil K. Raju
|
Leena G. Pillai
|
Kavya Manohar
|
Elizabeth Sherly
This study pioneers the development of an automatic speech recognition (ASR) system for the Malasar language, an extremely low-resource ethnic language spoken by a tribal community in the Western Ghats of South India. Malasar is primarily an oral language which does not have a native script. Therefore, Malasar is often transcribed in Tamil script, a closely related major language. This work presents the first ever effort of leveraging the capabilities of multilingual transfer learning for recognising malasar speech. We fine-tune a pre-trained multilingual transformer model with Malasar speech data. In our endeavour to fine-tune this model using a Malasar speech corpus, we could successfully bring down the WER to 48.00% from 99.08% (zero shot baseline). This work demonstrates the efficacy of multilingual transfer learning in addressing the challenges of ASR for extremely low-resource languages, contributing to the preservation of their linguistic and cultural heritage.
pdf
abs
Dy-poThon: A Bangla Sentence-Learning System for Children with Dyslexia
Dipshikha Podder
|
Manjira Sinha
|
Tirthankar Dasgupta
|
Anupam Basu
The number of assistive technologies available for dyslexia in Bangla is low and most of them do not use multisensory teaching methods. As a solution, a computer-based audio-visual system Dy-poThon is proposed to teach sentence reading in Bangla. It incorporates the multisensory teaching method through three activities, listening, reading, and writing, checks the reading and writing ability of the user and tracks the response time. A criteria-based evaluation was conducted with 28 special educators to evaluate Dy-poThon. Content, efficiency, ease of use and aesthetics are evaluated using a standardised questionnaire. The result suggests that Dy-poThon is useful for teaching Bangla sentence-reading.
pdf
abs
Mitigating Clickbait: An Approach to Spoiler Generation Using Multitask Learning
Sayantan Pal
|
Souvik Das
|
Rohini K. Srihari
With the increasing number of users on social media platforms, the detection and categorization of abusive comments have become crucial, necessitating effective strategies to mitigate their impact on online discussions. However, the intricate and diverse nature of lowresource Indic languages presents a challenge in developing reliable detection methodologies. This research focuses on the task of classifying YouTube comments written in Tamil language into various categories. To achieve this, our research conducted experiments utilizing various multi-lingual transformer-based models along with data augmentation approaches involving back translation approaches and other pre-processing techniques. Our work provides valuable insights into the effectiveness of various preprocessing methods for this classification task. Our experiments showed that the Multilingual Representations for Indian Languages (MURIL) transformer model, coupled with round-trip translation and lexical replacement, yielded the most promising results, showcasing a significant improvement of over 15 units in macro F1-score compared to existing baselines. This contribution adds to the ongoing research to mitigate the adverse impact of abusive content on online platforms, emphasizing the utilization of diverse preprocessing strategies and state-of-the-art language models.
pdf
abs
Comparing DAE-based and MASS-based UNMT: Robustness to Word-Order Divergence in English–>Indic Language Pairs
Tamali Banerjee
|
Rudra Murthy
|
Pushpak Bhattacharyya
The proliferation of fake news poses a significant challenge in the digital era. Detecting false information, especially in non-English languages, is crucial to combating misinformation effectively. In this research, we introduce a novel approach for Dravidian fake news detection by harnessing the capabilities of the MuRIL transformer model, further enhanced by gradient accumulation techniques. Our study focuses on the Dravidian languages, a diverse group of languages spoken in South India, which are often underserved in natural language processing research. We optimize memory usage, stabilize training, and improve the model’s overall performance by accumulating gradients over multiple batches. The proposed model exhibits promising results in terms of both accuracy and efficiency. Our findings underline the significance of adapting state-of-the-art techniques, such as MuRIL-based models and gradient accumulation, to non-English language.
pdf
abs
MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering
Ruturaj Ghatage
|
Aditya Ashutosh Kulkarni
|
Rajlaxmi Patil
|
Sharvi Endait
|
Raviraj Joshi
Question-answering systems have revolutionized information retrieval, but linguistic and cultural boundaries limit their widespread accessibility. This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages by translating the English Question Answering Dataset (SQuAD) using a robust data curation approach. We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples. Challenges in maintaining context and handling linguistic nuances are addressed, ensuring accurate translations. Moreover, as a QnA dataset cannot be simply converted into any low-resource language using translation, we need a robust method to map the answer translation to its span in the translated passage. Hence, to address this challenge, we also present a generic approach for translating SQuAD into any low-resource language. Thus, we offer a scalable approach to bridge linguistic and cultural gaps present in low-resource languages, in the realm of question-answering systems. The datasets and models are shared publicly at https://github.com/l3cube-pune/MarathiNLP.
pdf
abs
CASM - Context and Something More in Lexical Simplification
Atharva Kumbhar
|
Sheetal Sonawane
|
Dipali Kadam
|
Prathamesh Mulay
Lexical Simplification is a challenging task that aims to improve the readability of text for nonnative people, people with dyslexia, and any linguistic impairments. It consists of 3 components: 1) Complex Word Identification 2) Substitute Generation 3) Substitute Ranking. Current methods use contextual information as a primary source in all three stages of the simplification pipeline. We argue that while context is an important measure, it alone is not sufficient in the process. In the complex word identification step, contextual information is inadequate, moreover, heavy feature engineering is required to use additional linguistic features. This paper presents a novel architecture for complex word identification that uses a pre-trained transformer model’s information flow through its hidden layers as a feature representation that implicitly encodes all the features required for identification. We portray how database methods and masked language modeling can be complementary to one another in substitute generation and ranking process that is built on the foundational pillars of Simplicity, Grammatical and Semantic correctness, and context preservation. We show that our proposed model generalizes well and outperforms the current state-of-the-art on wellknown datasets.
pdf
abs
Improving the Evaluation of NLP Approaches for Scientific Text Annotation with Ontology Embedding-Based Semantic Similarity Metrics
Pratik Devkota
|
Somya D. Mohanty
|
Prashanti Manda
Lexical Simplification is a challenging task that aims to improve the readability of text for nonnative people, people with dyslexia, and any linguistic impairments. It consists of 3 components: 1) Complex Word Identification 2) Substitute Generation 3) Substitute Ranking. Current methods use contextual information as a primary source in all three stages of the simplification pipeline. We argue that while context is an important measure, it alone is not sufficient in the process. In the complex word identification step, contextual information is inadequate, moreover, heavy feature engineering is required to use additional linguistic features. This paper presents a novel architecture for complex word identification that uses a pre-trained transformer model’s information flow through its hidden layers as a feature representation that implicitly encodes all the features required for identification. We portray how database methods and masked language modeling can be complementary to one another in substitute generation and ranking process that is built on the foundational pillars of Simplicity, Grammatical and Semantic correctness, and context preservation. We show that our proposed model generalizes well and outperforms the current state-of-the-art on wellknown datasets.
pdf
abs
A Survey of using Large Language Models for Generating Infrastructure as Code
Kalahasti Ganesh Srivatsa
|
Sabyasachi Mukhopadhyay
|
Ganesh Katrapati
|
Manish Shrivastava
Infrastructure as Code (IaC) is a revolutionary approach which has gained significant prominence in the Industry. IaC manages and provisions IT infrastructure using machinereadable code by enabling automation, consistency across the environments, reproducibility, version control, error reduction and enhancement in scalability. However, IaC orchestration is often a painstaking effort which requires specialised skills as well as a lot of manual effort. Automation of IaC is a necessity in the present conditions of the Industry and in this survey, we study the feasibility of applying Large Language Models (LLM) to address this problem. LLMs are large neural network-based models which have demonstrated significant language processing abilities and shown to be capable of following a range of instructions within a broad scope. Recently, they have also been adapted for code understanding and generation tasks successfully, which makes them a promising choice for the automatic generation of IaC configurations. In this survey, we delve into the details of IaC, usage of IaC in different platforms, their challenges, LLMs in terms of code-generation aspects and the importance of LLMs in IaC along with our own experiments. Finally, we conclude by presenting the challenges in this area and highlighting the scope for future research.
pdf
abs
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India’s Very Low-Resource Languages
Atnafu Lambebo Tonja
|
Melkamu Mersha
|
Ananya Kalita
|
Olga Kolesnikova
|
Jugal Kalita
This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
pdf
abs
Kurosawa: A Script Writer’s Assistant
Prerak Gandhi
|
Vishal Pramanik
|
Pushpak Bhattacharyya
Storytelling is the lifeline of the entertainment industrymovies, TV shows, and stand-up comedies, all need stories. A good and gripping script is the lifeline of storytelling and demands creativity and resource investment. Good scriptwriters are rare to find and often work under severe time pressure. Consequently, entertainment media are actively looking for automation. In this paper, we present an AIbased script-writing workbench called KUROSAWA which addresses the tasks of plot generation and script generation. Plot generation aims to generate a coherent and creative plot (600–800 words) given a prompt (15–40 words). Script generation, on the other hand, generates a scene (200–500 words) in a screenplay format from a brief description (15–40 words). Kurosawa needs data to train. We use a 4-act structure of storytelling to annotate the plot dataset manually. We create a dataset of 1000 manually annotated plots and their corresponding prompts/storylines and a gold-standard dataset of 1000 scenes with four main elements — scene headings, action lines, dialogues, and character names — tagged individually. We fine-tune GPT-3 with the above datasets to generate plots and scenes. These plots and scenes are first evaluated and then used by the scriptwriters of a large and famous media platform ErosNow. We release the annotated datasets and the models trained on these datasets as a working benchmark for automatic movie plot and script generation.
pdf
abs
Text-2-Wiki: Summarization and Template-driven Article Generation
Jayant Panwar
|
Radhika Mamidi
Users on Wikipedia collaborate in a structured and organized manner to publish and update articles on numerous topics, which makes Wikipedia a very rich source of knowledge. English Wikipedia has the most amount of information available (more than 6.7 million articles); however, there are few good informative articles on Wikipedia in Indian languages. Hindi Wikipedia has approximately only 160k articles. The same article in Hindi can be vastly different from its English version and generally contains less information. This poses a problem for native Indian language speakers who are not proficient in English. Therefore, having the same amount of information in Indian Languages will help promote knowledge among those who are not well-versed in English. Publishing the articles manually, like the usual process in Global English Wikipedia, is a timeconsuming process. To get the amount of information in native Indian languages up-to-speed with the amount of information in English, automating the whole article generation process is the best option. In this study, we present a stage-wise approach ranging from Data Collection to Summarization and Translation, and finally ending with Template Creation. This approach ensures the efficient generation of a large amount of content in Hindi Wikipedia in less time. With the help of this study, we were able to successfully generate more than a thousand articles in Hindi Wikipedia with ease.
pdf
abs
Blind Leading the Blind: A Social-Media Analysis of the Tech Industry
Tanishq Chaudhary
|
Pulak Malhotra
|
Radhika Mamidi
|
Ponnurangam Kumaraguru
Online social networks (OSNs) have changed the way we perceive careers. A standard screening process for employees now involves profile checks on LinkedIn, X, and other platforms, with any negative opinions scrutinized. Blind, an anonymous social networking platform, aims to satisfy this growing need for taboo workplace discourse. In this paper, for the first time, we present a large-scale empirical text-based analysis of the Blind platform. We acquire and release two novel datasets: 63k Blind Company Reviews and 767k Blind Posts, containing over seven years of industry data. Using these, we analyze the Blind network, study drivers of engagement, and obtain insights into the last eventful years, preceding, during, and post-COVID-19, accounting for the modern phenomena of work-from-home, return-to-office, and the layoffs surrounding the crisis. Finally, we leverage the unique richness of the Blind content and propose a novel content classification pipeline to automatically retrieve and annotate relevant career and industry content across other platforms. We achieve an accuracy of 99.25% for filtering out relevant content, 78.41% for fine-grained annotation, and 98.29% for opinion mining, demonstrating the high practicality of our software.
pdf
abs
A Unified Multi task Learning Architecture for Hate Detection Leveraging User-based Information
Prashant Kapil
|
Asif Ekbal
Hate speech, offensive language, aggression, racism, sexism, and other abusive language is a common phenomenon in social media. There is a need for Artificial Intelligence (AI) based intervention which can filter hate content at scale. Most existing hate speech detection solutions have utilized the features by treating each post as an isolated input instance for the classification. This paper addresses this issue by introducing a unique model that improves hate speech identification for the English language by utilising intra-user and inter-user-based information. The experiment is conducted over single-task learning (STL) and multi-task learning (MTL) paradigms that use deep neural networks, such as convolution neural network (CNN), gated recurrent unit (GRU), bidirectional encoder representations from the transformer (BERT), and A Lite BERT (ALBERT). We use three benchmark datasets and conclude that combining certain user features with textual features gives significant improvements in macro-F1 and weightedF1.
pdf
abs
Mytho-Annotator: An Annotation tool for Indian Hindu Mythology
Apurba Paul
|
Anupam Mondal
|
Sainik Kumar Mahata
|
Srijan Seal
|
Prasun Sarkar
|
Dipankar Das
Mythology is a collection of myths, especially one belonging to a particular religious or cultural tradition. We observed that an annotation tool is essential to identify important and complex information from any mythological texts or corpora. Additionally, obtaining highquality annotated corpora for complex information extraction including labeled text segments is an expensive and timeconsuming process. Hence, in this paper, we have designed and deployed an annotation tool for Hindu mythology which is presented as Mytho-Annotator. Its easy-to-use web-based text annotation tool is powered by Natural Language Processing (NLP). This tool primarily labels three different categories such as named entities, relationships, and event entities. This annotation tool offers a comprehensive and adaptable annotation paradigm.
pdf
abs
Transformer-based Bengali Textual Emotion Recognition
Md. Atabuzzaman
|
Mst Maksuda Bilkis Baby
|
Md. Shajalal
Emotion recognition for high-resource languages has progressed significantly. However, resource-constrained languages such as Bengali have not advanced notably due to the lack of large benchmark datasets. Besides this, the need for more Bengali language processing tools makes the emotion recognition task more challenging and complicated. Therefore, we developed the largest dataset in this paper, consisting of almost 12k Bengali texts with six basic emotions. Then, we conducted experiments on our dataset to establish the baseline performance applying machine learning, deep learning, and transformer-based models as emotion classifiers. The experimental results demonstrate that the models achieved promising performance in Bengali emotion recognition.
pdf
abs
Citation-Based Summarization of Landmark Judgments
Purnima Bindal
|
Vikas Kumar
|
Vasudha Bhatnagar
|
Parikshet Sirohi
|
Ashwini Siwal
Landmark judgments are of prime importance in the Common Law System because of their exceptional jurisprudence and frequent references in other judgments. In this work, we leverage contextual references available in citing judgments to create an extractive summary of the target judgment. We evaluate the proposed algorithm on two datasets curated from the judgments of Indian Courts and find the results promising.
pdf
abs
Aspect and Opinion Term Extraction Using Graph Attention Network
Abir Chakraborty
In this work we investigate the capability of Graph Attention Network for extracting aspect and opinion terms. Aspect and opinion term extraction is posed as a token-level classification task akin to named entity recognition. We use the dependency tree of the input query as additional feature in a Graph Attention Network along with the token and part-of-speech features. We show that the dependency structure is a powerful feature that in the presence of a CRF layer substantially improves the performance and generates the best result on the commonly used datasets from SemEval 2014, 2015 and 2016. We experiment with additional layers like BiLSTM and Transformer in addition to the CRF layer. We also show that our approach works well in the presence of multiple aspects or sentiments in the same query and it is not necessary to modify the dependency tree based on a single aspect as was the original application for sentiment classification.
pdf
abs
Abstractive Hindi Text Summarization: A Challenge in a Low-Resource Setting
Daisy Monika Lal
|
Paul Rayson
|
Krishna Pratap Singh
|
Uma Shanker Tiwary
The Internet has led to a surge in text data in Indian languages; hence, text summarization tools have become essential for information retrieval. Due to a lack of data resources, prevailing summarizing systems in Indian languages have been primarily dependent on and derived from English text summarization approaches. Despite Hindi being the most widely spoken language in India, progress in Hindi summarization is being delayed due to the lack of proper labeled datasets. In this preliminary work we address two major challenges in abstractive Hindi text summarization: creating Hindi language summaries and assessing the efficacy of the produced summaries. Since transfer learning (TL) has shown to be effective in low-resource settings, in order to assess the effectiveness of TL-based approach for summarizing Hindi text, we perform a comparative analysis using three encoder-decoder models: attention-based (BASE), multi-level (MED), and TL-based model (RETRAIN). In relation to the second challenge, we introduce the ICE-H evaluation metric based on the ICE metric for assessing English language summaries. The Rouge and ICE-H metrics are used for evaluating the BASE, MED, and RETRAIN models. According to the Rouge results, the RETRAIN model produces slightly better abstracts than the BASE and MED models for 20k and 100k training samples. The ICE-H metric, on the other hand, produces inconclusive results, which may be attributed to the limitations of existing Hindi NLP resources, such as word embeddings and POS taggers.
pdf
abs
Verb Categorisation for Hindi Word Problem Solving
Harshita Sharma
|
Pruthwik Mishra
|
Dipti M. Sharma
Word problem Solving is a challenging NLP task that deals with solving mathematical probglems described in natural language. Recently, there has been renewed interest in developing word problem solvers for Indian languages. As part of this paper, we have built a Hindi arithmetic word problem solver which makes use of verbs. Additionally, we have created verb categorization data for Hindi. Verbs are very important for solving word problems with addition/subtraction operations as they help us identify the set of operations required to solve the word problems. We propose a rule-based solver that uses verb categorisation to identify operations in a word problem and generate answers for it. To perform verb categorisation, we explore several approaches and present a comparative study.
pdf
abs
ReviewCraft : A Word2Vec Driven System Enhancing User-Written Reviews
Gaurav Sawant
|
Pradnya Bhagat
|
Jyoti D. Pawar
The significance of online product reviews has become indispensable for customers in making informed buying decisions, while e-commerce platforms use them to fine tune their recommender systems. However, since review writing is purely a voluntary process without any incentives, most customers opt out from writing reviews or write poor-quality ones. This lack of engagement poses credibility issues as fake or biased reviews can mislead buyers who rely on them for informed decision-making. To address this issue, this paper introduces a system that suggests product features and appropriate sentiment words to help users write informative product reviews in a structured manner. The system is based on Word2Vec model and Chi square test. The evaluation results demonstrates that the reviews with recommendations showed a 2 fold improvement both, in the quality of the features covered and correct usage of sentiment words, as well as a 19% improvement in overall usefulness compared to reviews without recommendations. Keywords: Word2Vec, Chi-square, Sentiment words, Product Aspect/Feature.
pdf
abs
Intent Detection and Zero-shot Intent Classification for Chatbots
Sobha Lalitha Devi
|
Pattabhi RK. Rao
In this paper we give in detail how seen and unseen intent is detected and classified. User intent detection has a critical role in dialogue systems. While analysing the intents it has been found that intents are diversely expressed and new variety of intents emerge continuously. Here we propose a capsule-based approach that classifies the intent and a zero-shot learning to identify the unseen intent. There are recently proposed methods on zero-shot classification which are implemented differently from ours. We have also developed an annotated corpus of free conversations in Tamil, the language we have used for intent classification and for our chatbot. Our proposed method on intent classification performs well.
pdf
abs
Coreference Resolution Using AdapterFusion-based Multi-Task learning
Sobha Lalitha Devi
|
Vijay Sundar Ram R.
|
Pattabhi RK. Rao
End-to-end coreference resolution is the task of identifying the mentions in a text that refer to the same real world entity and grouping them into clusters. It is crucially required for natural language understanding tasks and other high-level NLP tasks. In this paper, we present an end-to-end architecture for neural coreference resolution using AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks. First task is in identifying the mentions in the text and the second to determine the coreference clusters. In the first task we learn task specific parameters called adapters that encapsulate the taskspecific information and then combine the adapters in a separate knowledge composition step to identify the mentions and their clusters. We evaluated it using FIRE corpus for Malayalam and Tamil and we achieved state of art performance.
pdf
abs
Transfer learning in low-resourced MT: An empirical study
Sainik Kumar Mahata
|
Dipanjan Saha
|
Dipankar Das
|
Sivaji Bandyopadhyay
Translation systems rely on a large and goodquality parallel corpus for producing reliable translations. However, obtaining such a corpus for low-resourced languages is a challenge. New research has shown that transfer learning can mitigate this issue by augmenting lowresourced MT systems with high-resourced ones. In this work, we explore two types of transfer learning techniques, namely, crosslingual transfer learning and multilingual training, both with information augmentation, to examine the degree of performance improvement following the augmentation. Furthermore, we use languages of the same family (Romanic, in our case), to investigate the role of the shared linguistic property, in producing dependable translations.
pdf
abs
Transformer-based Nepali Text-to-Speech
Ishan Dongol
|
Bal Krishna Bal
Research on Deep learning-based Text-toSpeech (TTS) systems has gained increasing popularity in low-resource languages as this approach is not only computationally robust but also has the capability to produce state-ofthe-art results. However, these approaches are yet to be significantly explored for the Nepali language, primarily because of the lack of adequate size datasets and secondarily because of the relatively sophisticated computing resources they demand. This paper explores the FastPitch acoustic model with HiFi-GAN vocoder for the Nepali language. We trained the acoustic model with two datasets, OpenSLR and a dataset prepared jointly by the Information and Language Processing Research Lab (ILPRL) and the Nepal Association of the Blind (NAB), to be further referred to as the ILPRLNAB dataset. We achieved a Mean Opinion Score (MOS) of 3.70 and 3.40 respectively for the same model with different datasets. The synthesized speech produced by the model was found to be quite natural and of good quality.
pdf
abs
Infusing Knowledge into Large Language Models with Contextual Prompts
Kinshuk Vasisht
|
Balaji Ganesan
|
Vikas Kumar
|
Vasudha Bhatnagar
Knowledge infusion is a promising method for enhancing Large Language Models for domainspecific NLP tasks rather than pre-training models over large data from scratch. These augmented LLMs typically depend on additional pre-training or knowledge prompts from an existing knowledge graph, which is impractical in many applications. In contrast, knowledge infusion directly from relevant documents is more generalisable and alleviates the need for structured knowledge graphs while also being useful for entities that are usually not found in any knowledge graph. With this motivation, we propose a simple yet generalisable approach for knowledge infusion by generating prompts from the context in the input text. Our experiments show the effectiveness of our approach which we evaluate by probing the fine-tuned LLMs.
pdf
abs
Can Big Models Help Diverse Languages? Investigating Large Pretrained Multilingual Models for Machine Translation of Indian Languages
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Machine translation of Indian languages is challenging due to several factors, including linguistic diversity, limited parallel data, language divergence, and complex morphology. Recently, large pre-trained multilingual models have shown promise in improving translation quality. In this paper, we conduct a large-scale study on applying large pre-trained models for English-Indic machine translation through transfer learning across languages and domains. This study systematically evaluates the practical gains these models can provide and analyzes their capabilities for the translation of the Indian language by transfer learning. Specifically, we experiment with several models, including Meta’s mBART, mBART-manyto-many, NLLB-200, M2M-100, and Google’s MT5. These models are fine-tuned on small, high-quality English-Indic parallel data across languages and domains. Our findings show that adapting large pre-trained models to particular languages by fine-tuning improves translation quality across the Indic languages, even for languages unseen during pretraining. Domain adaptation through continued fine-tuning improves results. Our study provides insights into utilizing large pretrained models to address the distinct challenges of MT of Indian languages.
pdf
abs
Revolutionizing Authentication: Harnessing Natural Language Understanding for Dynamic Password Generation and Verification
Akram Al-Rumaim
|
Jyoti D. Pawar
In our interconnected digital ecosystem, API security is paramount. Traditional static password systems once used for API authentication, face vulnerabilities to cyber threats. This paper explores Natural Language Understanding (NLU) as a tool for dynamic password solutions, achieving 49.57% accuracy. It investigates GPT-2 for dynamic password generation and innovative NLU-based verification using a set of specific criteria and threshold adjustments. The study highlights NLU’s benefits, challenges, and prospects in enhancing API security. This approach is a significant stride in safeguarding digital interfaces amidst evolving Cyber Security threats. Keywords: Cyber Security, Authentication, API Security, Generative AI, Dynamic Passwords, Passwords Verification, NLU
pdf
abs
Leveraging Empathy, Distress, and Emotion for Accurate Personality Subtyping from Complex Human Textual Responses
Soumitra Ghosh
|
Tanisha Tiwari
|
Chetna Painkra
|
Gopendra Vikram Singh
|
Asif Ekbal
Automated personality subtyping is a crucial area of research with diverse applications in psychology, healthcare, and marketing. However, current studies face challenges such as insufficient data, noisy text data, and difficulty in capturing complex personality traits. To address these issues, including empathy, distress, and emotion as auxiliary tasks in automated personality subtyping may enhance accuracy and robustness. This study introduces a Multi-input Multi-task Framework for Personality, Empathy, Distress, and Emotion Detection (MultiPEDE). This framework harnesses the complementary information from empathy, distress, and emotion tasks (auxiliary tasks) to enhance the accuracy and generalizability of automated personality subtyping (the primary task). The model uses a novel deep-learning architecture that captures the interdependencies between these constructs, is end-to-end trainable, and does not rely on ensemble strategies, making it practical for real-world applications. Performance evaluation involves labeled examples of five personality traits, two classes each for personality, empathy, and distress detection, and seven classes for emotion detection. This approach has diverse applications, including mental health diagnosis, improving online services, and aiding job candidate selection.
pdf
abs
A Baseline System for Khasi and Assamese Bidirectional NMT with Zero available Parallel Data: Dataset Creation and System Development
Kishore Kashyap
|
Kuwali Talukdar
|
Mazida Akhtara Ahmed
|
Parvez Aziz Boruah
In this work we have tried to build a baseline Neural Machine Translation system for Khasi and Assamese in both directions. Both the languages are considered as low-resourced Indic languages. As per the language family in concerned, Assamese is a language from IndoAryan family and Khasi belongs to the MonKhmer branch of the Austroasiatic language family. No prior work is done which investigate the performance of Neural Machine Translation for these two diverse low-resourced languages. It is also worth mentioning that no parallel corpus and test data is available for these two languages. The main contribution of this work is the creation of Khasi-Assamese parallel corpus and test set. Apart from this, we also created baseline systems in both directions for the said language pair. We got best bilingual evaluation understudy (BLEU) score of 2.78 for Khasi to Assamese translation direction and 5.51 for Assamese to Khasi translation direction. We then applied phrase table injection (phrase augmentation) technique and got new higher BLEU score of 5.01 and 7.28 for Khasi to Assamese and Assamese to Khasi translation direction respectively.
pdf
abs
Parts of Speech (PoS) and Universal Parts of Speech (UPoS) Tagging: A Critical Review with Special Reference to Low Resource Languages
Kuwali Talukdar
|
Shikhar Kumar Sarma
|
Manash Pratim Bhuyan
Universal Parts of Speech (UPoS) tags are parts of speech annotations used in Universal Dependencies. Universal Dependency (UD) helps in developing cross-linguistically consistent treebank annotations for multiple languages with a common framework and standard. For various Natural Language Processing (NLP) tasks and research such as semantic parsing, syntactic parsing as well as linguistic parsing, UD treebanks are becoming increasingly important resources. A lot of interest has been seen in adopting UD and UPoS standards and resources for integrating with various NLP techniques, including Machine Translations, Question Answering, Sentiment Analysis etc. Consequently, a wide variety of Artificial Intelligence (AI) and NLP tools are being created with UD and UPoS standards on board. Part of Speech (PoS) tagging is one of the fundamental NLP tasks, which labels a specific sentence or set of words in a paragraph with lexical and grammatical annotations, based on the context of the sentence. Contemporary Machine Learning (ML) and Deep Learning (DL) techniques require god quality tagged resources for training potential tagger models. Low resource languages face serious challenges here. This paper discusses about the UPoS in UD and presents a concise yet inclusive piece of literature regarding UPoS, PoS, and various taggers for multiple languages with special reference to various low resource languages. Already adopted approaches and models developed for different low resource languages are included in this review, considering representations from a wide variety of languages. Also, the study offers a comprehensive classification based on the well-known ML and DL techniques used in the development of part-of-speech taggers. This will serve as a ready-reference for understanding nuances of PoS and UPoS tagging.
pdf
abs
Neural Machine Translation for Assamese-Bodo, a Low Resourced Indian Language Pair
Kuwali Talukdar
|
Shikhar Kumar Sarma
|
Farha Naznin
|
Kishore Kashyap
|
Mazida Akhtara Ahmed
|
Parvez Aziz Boruah
Impressive results have been reported in various works related to low resource languages, using Neural Machine Translation (NMT), where size of parallel dataset is relatively low. This work presents the experiment of Machine Translation in the low resource Indian language pair AssameseBodo, with a relatively low amount of parallel data. Tokenization of raw data is done with IndicNLP tool. NMT model is trained with preprocessed dataset, and model performances have been observed with varying hyper parameters. Experiments have been completed with Vocab Size 8000 and 16000. Significant increase in BLEU score has been observed in doubling the Vocab size. Also data size increase has contributed to enhanced overall performances. BLEU scores have been recorded with training on a data set of 70000 parallel sentences, and the results are compared with another round of training with a data set enhanced with 11500 Wordnet parallel data. A gold standard test data set of 500 sentence size has been used for recording BLEU. First round reported an overall BLEU of 4.0, with vocab size of 8000. With same vocab size, and Wordnet enhanced dataset, BLEU score of 4.33 was recorded. Significant increase of BLEU score (6.94) has been observed with vocab size of 16000. Next round of experiment was done with additional 7000 new data, and filtering the entire dataset. New BLEU recorded was 9.68, with 16000 vocab size. Cross validation has also been designed and performed with an experiment with 8-fold data chunks prepared on 80K total dataset. Impressive BLEU scores of (Fold-1 through fold-8) 18.12, 16.28, 18.90, 19.25, 19.60, 18.43, 16.28, and 7.70 have been recorded. The 8th fold BLEU deviated from the trend, might be because of nonhomogeneous last fold data.
pdf
abs
Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection
Atanu Mandal
|
Gargi Roy
|
Amit Barman
|
Indranil Dutta
|
Sudip Kumar Naskar
With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called “Attentive Fusion”. The results of our study surpassed previous stateof-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set.
pdf
abs
Handwritten Text Segmentation Using U-Net and Shuffled Frog-Leaping Algorithm with Scale Space Technique
Moumita Moitra
|
Sujan Kumar Saha
The paper introduces a new method for segmenting words from handwritten Bangla documents. We found that the available handwritten character recognition (HCR) systems do not provide the desired accuracy in recognizing the text written by school students. Recognizing students’ handwritten text becomes challenging due to certain factors, including a non-uniform gap between lines and words, and ambiguous, overlapping characters. The performance may be improved if the words in the text are segmented correctly before recognition. For the segmentation, we propose a combination of U-Net and a modified Scale Space method enhanced by the Shuffled Frog-Leaping Algorithm (SFLA). We employ the U-Net model for line segmentation; it effectively handles the variable spacing and skewed lines. After line segmentation, for segmenting the words, we use SFLA with Scale Space, allowing adaptive scaling and optimized parameter tuning. The proposed technique has been tested on two datasets: the openly available BN-HTR dataset and an in-house dataset prepared by collecting Bengali handwritten answer books from schools. In our experiments, we found that the proposed technique achieved promising performance on both datasets.
pdf
abs
Identifying Correlation between Sentiment Analysis and Septic News Sentences Classification Tasks
Soma Das
|
Sagarika Ghosh
|
Sanjay Chatterji
This research investigates the correlation between Sentiment and SEPSIS(SpEculation, oPinion, biaS, and twISt) characteristics in news sentences through an ablation study. Various Sentiment analysis models, including TextBlob, Vader, and RoBERTa, are examined to discern their impact on news sentences. Additionally, we explore the Logistic Regression(LR), Decision Trees(DT), Support Vector Machines(SVM) and Convolutional Neural Network (CNN) models for Septic sentence classification.
pdf
abs
KT2: Kannada-Tulu Parallel Corpus Construction for Neural Machine Translation
Asha Hegde
|
Hosahalli Lakshmaiah Shashirekha
In the last decade, Neural Machine Translation (NMT) has experienced substantial advances. However, its widespread success has revealed a limitation in terms of reduced proficiency when dealing with under-resourced language pairs, mainly due to the lack of parallel corpora in comparison to high-resourced language pairs like English-German, EnglishSpanish, and English-French. As a result, researchers have increasingly focused on implementing NMT techniques tailored to underresourced language pairs and thereby, the construction/collection of parallel corpora. In view of the scarcity of parallel corpus for underresourced languages, the strategies for building a Kannada-Tulu parallel corpus and baseline models for Machine Translation (MT) of Kannada-Tulu are described in this paper. Both Kannada and Tulu languages are under-resourced due to lack of processing tools and digital resources, especially parallel corpora, which are critical for MT development. Kannada-Tulu parallel corpus is constructed in two ways: i) Manual Translation and ii) Automatic Text Generation (ATG). Various encoderdecoder based NMT approaches, including Recurrent Neural Network (RNN), Bidirectional RNN (BiRNN), and transformer-based architectures, trained with Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM) units, are explored as baseline models for Kannada to Tulu (Kan-Tul) and Tulu to Kannada (Kan-Tul) sentence-level translations. Additionally, the study explores sub-word tokenization techniques for Kannada-Tulu language pairs, and the performances of these NMT models are evaluated using Character n-gram Fscore (CHRF) and Bilingual Evaluation Understudy (BLEU) scores. Among the baselines, the transformer-based models outperformed other models with BLEU scores of 0.241 and 0.341 and CHRF scores of 0.502 and 0.598 for KanTul and Kan-Tul sentence-level translations, respectively.
pdf
abs
Word Sense Disambiguation for Marathi language using Supervised Learning
Rasika Ransing
|
Archana Gulati
The task of disambiguating word senses, often referred to as Word Sense Disambiguation (WSD), is a substantial difficulty in the realm of natural language processing. Marathi is widely acknowledged as a language that has a relatively restricted range of resources. Consequently, there has been a paucity of academic research undertaken on the Marathi language. There has been little research conducted on supervised learning for Marathi Word Sense Disambiguation (WSD) mostly owing to the scarcity of sense-annotated corpora. This work aims to construct a sense-annotated corpus for the Marathi language and further use supervised learning classifiers, such as Naïve Bayes, Support Vector Machine, Random Forest, and Logistic Regression, to disambiguate polysemous words in Marathi. The performance of these classifiers is evaluated.
pdf
abs
Enhancing Telugu Part-of-Speech Tagging with Deep Sequential Models and Multilingual Embeddings
Sai Rishith Reddy Mangamuru
|
Sai Prashanth Karnati
|
Bala Karthikeya Sajja
|
Divith Phogat
|
Premjith B.
Part-of-speech (POS) tagging is a fundamental task in natural language processing (NLP) that involves assigning grammatical categories to words in a sentence. In this study, we investigate the application of deep sequential models for POS tagging of Telugu, a low-resource Dravidian language with rich morphology. We use the Universal dependencies dataset for this research and explore various deep learning architectures, including Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Gated Recurrent Units (GRUs), and their stacked variants for POS tagging. Additionally, we utilize multilingual BERT embeddings and indicBERT embeddings to capture contextual information from the input sequences. Our experiments demonstrate that stacked LSTM with multilingual BERT embeddings achieves the highest performance, outperforming other approaches and attaining an F1 score of 0.8812. These findings suggest that deep sequential models, particularly stacked LSTMs with multilingual BERT embeddings, are effective tools for POS tagging in Telugu.
pdf
abs
Unlocking Emotions in Text: A Fusion of Word Embeddings and Lexical Knowledge for Emotion Classification
Anjali Bhardwaj
|
Nesar Ahmad Wasi
|
Muhammad Abulaish
This paper introduces an improved method for emotion classification through the integration of emotion lexicons and pre-trained word embeddings. The proposed method utilizes semantically similar features to reconcile the semantic gap between words and emotions. The proposed approach is compared against three baselines for predicting Ekman’s emotions at the document level on the GoEmotions dataset. The effectiveness of the proposed approach is assessed using standard evaluation metrics, which show at least a 5% gain in performance over baselines.
pdf
abs
Convolutional Neural Networks can achieve binary bail judgement classification
Amit Barman
|
Devangan Roy
|
Debapriya Paul
|
Indranil Dutta
|
Shouvik Kumar Guha
|
Samir Karmakar
|
Sudip Kumar Naskar
There is an evident lack of implementation of Machine Learning (ML) in the legal domain in India, and any research that does take place in this domain is usually based on data from the higher courts of law and works with English data. The lower courts and data from the different regional languages of India are often overlooked. In this paper, we deploy a Convolutional Neural Network (CNN) architecture on a corpus of Hindi legal documents. We perform a bail Prediction task with the help of a CNN model and achieve an overall accuracy of 93% which is an improvement on the benchmark accuracy, set by Kapoor et al. (2022), albeit in data from 20 districts of the Indian state of Uttar Pradesh.
pdf
abs
Multiset Dual Summarization for Incongruent News Article Detection
Sujit Kumar
|
Rohan Jaiswal
|
Mohit Ram Sharma
|
Sanasam Ranbir Singh
The prevalence of deceptive and incongruent news headlines has highlighted their substantial role in the propagation of fake news, exacerbating the spread of both misinformation and disinformation. Existing studies on incongruity detection primarily concentrate on estimating the similarity between the encoded representation of headlines and the encoded representation or summary representative vector of the news body. In the process of obtaining the encoded representation of the news body, researchers often consider either sequential encoding or hierarchical encoding of the news body or to acquire a summary representative vector of the news body, they explore techniques like summarization or dual summarization methods. Nevertheless, when it comes to detecting partially incongruent news, dual summarization-based methods tend to outperform hierarchical encoding-based methods. On the other hand, for datasets focused on detecting fake news, where the hierarchical structure within a news article plays a crucial role, hierarchical encoding-based methods tend to perform better than summarization-based methods. Recognizing this contradictory performance of hierarchical encoding-based and summarizationbased methods across datasets with different characteristics, we introduced a novel approach called Multiset Dual Summarization (MDS). MDS combines the strengths of both hierarchical encoding and dual summarization methods to leverage their respective advantages. We conducted experiments on datasets with diverse characteristics, and our findings demonstrate that our proposed model outperforms established state-of-the-art baseline models.
pdf
abs
A comparative study of transformer and transfer learning MT models for English-Manipuri
Kshetrimayum Boynao Singh
|
Ningthoujam Avichandra Singh
|
Loitongbam Sanayai Meetei
|
Ningthoujam Justwant Singh
|
Thoudam Doren Singh
|
Sivaji Bandyopadhyay
In this work, we focus on the development of machine translation (MT) models of a lowresource language pair viz. English-Manipuri. Manipuri is one of the eight scheduled languages of the Indian constitution. Manipuri is currently written in two different scripts: one is its original script called Meitei Mayek and the other is the Bengali script. We evaluate the performance of English-Manipuri MT models based on transformer and transfer learning technique. Our MT models are trained using a dataset of 69,065 parallel sentences and validated on 500 sentences. Using 500 test sentences, the English to Manipuri MT models achieved a BLEU score of 19.13 and 29.05 with mT5 and OpenNMT respectively. The results demonstrate that the OpenNMT model significantly outperforms the mT5 model. Additionally, Manipuri to English MT system trained with OpenNMT model reported a BLEU score of 30.90. We also carried out a comparative analysis between the Bengali script and the transliterated Meitei Mayek script for English-Manipuri MT models. This analysis reveals that the transliterated version enhances the MT model performance resulting in a notable +2.35 improvement in the BLEU score.
pdf
abs
The Current Landscape of Multimodal Summarization
Atharva Kumbhar
|
Harsh Kulkarni
|
Atmaja Mali
|
Sheetal Sonawane
|
Prathamesh Mulay
In recent years, the rise of multimedia content on the internet has inundated users with a vast and diverse array of information, including images, videos, and textual data. Handling this flood of multimedia data necessitates advanced techniques capable of distilling this wealth of information into concise, meaningful summaries. Multimodal summarization, which involves generating summaries from multiple modalities such as text, images, and videos, has become a pivotal area of research in natural language processing, computer vision, and multimedia analysis. This survey paper offers an overview of the state-of-the-art techniques, methodologies, and challenges in the domain of multimodal summarization. We highlight the interdisciplinary advancements made in this field specifically on the lines of two main frontiers:1) Multimodal Abstractive Summarization, and 2) Pre-training Language Models in Multimodal Summarization. By synthesizing insights from existing research, we aim to provide a holistic understanding of multimodal summarization techniques.
pdf
abs
Automated Answer Validation using Text Similarity
Balaji Ganesan
|
Arjun Ravikumar
|
Lakshay Piplani
|
Rini Bhaumik
|
Dhivya Padmanaban
|
Shwetha Narasimhamurthy
|
Chetan Adhikary
|
Subhash Deshapogu
Automated answer validation can help improve learning outcomes by providing appropriate feedback to learners, and by making question answering systems and online learning solutions more widely available. There have been some works in science question answering which show that information retrieval methods outperform neural methods, especially in the multiple choice version of this problem. We implement Siamese neural network models and produce a generalised solution to this problem. We compare our supervised model with other text similarity based solutions.
pdf
abs
QeMMA: Quantum-Enhanced Multi-Modal Sentiment Analysis
Arpan Phukan
|
Asif Ekbal
Multi-modal data analysis presents formidable challenges, as developing effective methods to capture correlations among different modalities remains an ongoing pursuit. In this study, we address multi-modal sentiment analysis through a novel quantum perspective. We propose that quantum principles, such as superposition, entanglement, and interference, offer a more comprehensive framework for capturing not only the cross-modal interactions between text, acoustics, and visuals but also the intricate relations within each modality. To empirically evaluate our approach, we employ the CMUMOSEI dataset as our testbed and utilize Qiskit by IBM to run our experiments on a quantum computer. Our proposed Quantum-Enhanced Multi-Modal Analysis Framework (QeMMA) showcases its significant potential by surpassing the baseline by 3.52% and 10.14% in terms of accuracy and F1 score, respectively, highlighting the promise of quantum-inspired methodologies.
pdf
abs
Automatic Data Retrieval for Cross Lingual Summarization
Nikhilesh Bhatnagar
|
Ashok Urlana
|
Pruthwik Mishra
|
Vandan Mujadia
|
Dipti M. Sharma
Cross-lingual summarization involves the sum marization of text written in one language to a different one. There is a body of research addressing cross-lingual summarization from English to other European languages. In this work, we aim to perform cross-lingual summarization from English to Hindi. We propose pairing up the coverage of newsworthy events in textual and video format can prove to be helpful for data acquisition for cross lingual summarization. We analyze the data and propose methods to match articles to video descriptions that serve as document and summary pairs. We also outline filtering methods over reasonable thresholds to ensure the correctness of the summaries. Further, we make available 28,583 mono and cross-lingual article-summary pairs* . We also build and analyze multiple baselines on the collected data and report error analysis.
pdf
Cross-Lingual Fact Checking: Automated Extraction and Verification of Information from Wikipedia using References
Shivansh Subramanian
|
Ankita Maity
|
Aakash Jain
|
Bhavyajeet Singh
|
Harshit Gupta
|
Lakshya Khanna
|
Vasudeva Varma
pdf
Combining Pre trained Speech and Text Encoders for Continuous Spoken Language Processing
Karan Singla
|
Mahnoosh Mehrabani
|
Daniel Pressel
|
Ryan Price
|
Bhargav Srinivas Chinnari
|
Yeon-Jun Kim
|
Srinivas Bangalore