Sobha Lalitha Devi

Also published as: Sobha Lalitha Devi, Lalitha Devi Sobha

2025

pdf bib abs
SeCoRel: Multilingual Discourse Analysis in DISRPT 2025
Sobha Lalitha Devi | Pattabhi Rk Rao | Vijay Sundar Ram
Proceedings of the 4th Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2025)

The work presented here describes our participation in DISRPT 2025 shared task in three tasks, Task1: Discourse Unit Segmentation across Formalisms, Task 2: Discourse Connective Identification across Languages and Task 3: Discourse Relation Classification across Formalisms. We have fine-tuned XLM-RoBERTa, a language model to address these three tasks. We have come up with one single multilingual language model for each task. Our system handles data in both the formats .conllu and .tok and different discourse formalisms. We have obtained encouraging results. The performance on test data in the three tasks is similar to the results obtained for the development data.

pdf bib abs
CLRG@FinCausal2025: Cause-Effect Extraction in Finance Domain
Vibhavkrishnan K S | Pattabhi RK Rao | Sobha Lalitha Devi
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

This paper presents our work on Cause-Effect information extraction specifically in the financial domain. Cause and effect information is very much needed for expert decision making. Particularly, in the financial domain, the fund managers, financial analysts, etc. need to have the information on cause-effects for their works. Natural Language Processing (NLP) techniques help in the automatic extraction of cause and effect from a given text. In this work, we build various cause-effect text span detection models using pre-trained transformer-based language models and fine tune these models using the data provided by FinCausal 2025 task organizers. We have only used FinCausal 2025 data sets to train our models. No other external data is used. Our ensemble of sequence tagging models based on the Fine-tuned RoBERTa-Large language model achieves SAS score of 0.9604 and Exact match score of 0.7214 for English. Similarly for Spanish we obtain SAS score of 0.9607 and Exact match score of 0.7166. This is our first time participation in the FinCausal 2025 Task.

2024

pdf bib
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Sobha Lalitha Devi | Karunesh Arora
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

pdf bib abs
End to End Multilingual Coreference Resolution for Indian Languages
Sobha Lalitha Devi | Vijay Sundar Ram | Pattabhi RK Rao
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

This paper describes an approach on an end to end model for Multilingual Coreference Resolution (CR) for low resource languages such as Tamil, Malayalam and Hindi. We have done fine tune the XLM-Roberta large model on multilingual training dataset using specific languages with linguistic features and without linguistic features. XLM-R with linguistic features achieves better results than the baseline system. This shows that giving the linguistic knowledge enriches the system performance. The performance of the system is comparable with the state of the art systems.

pdf bib abs
LangBot-Language Learning Chatbot
Madhubala Sundaram | Pattabhi RK Rao | Sobha Lalitha Devi
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Chatbots are being widely used in educational domain to revolutionize how students interact and learn along with traditional methods of learning. This paper presents our work on LangBot, a chatbot developed for learning Tamil language. LangBot developed integrates the interactive features of chatbots with the study material of the Tamil courses offered by Tamil Virtual Academy, Government of Tamil Nadu. LangBot helps students in enhancing their learning skills and increases their interest in learning the language. Using semi-automatic methods, we generate question and answers related to all topics in the courses. We then develop a generative language model and also Retrieval Augmented Generation (RAG) so that the system can incorporate new syllabus changes. We have performed manual user studies. The results obtained are encouraging. This approach offers learners an interactive tool that aligns with their syllabus. It is observed that this enriches the overall learning experience.

pdf bib abs
Finding the Causality of an Event in News Articles
Sobha Lalitha Devi | Pattabhi RK Rao
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

This paper discusses about the finding of causality of an event in newspaper articles. The analysis of causality , otherwise known as cause and effect is crucial for building efficient Natural Language Understanding (NLU) supported AI systems such as Event tracking and it is considered as a complex semantic relation under discourse theory. A cause-effect relation consists of a linguistic marker and its two arguments. The arguments are semantic arguments where the cause is the first argument (Arg1) and the effect is the second argument(Arg2). In this work we have considered the causal relations in Tamil Newspaper articles. The analysis of causal constructions, the causal markers and their syntactic relation lead to the identification of different features for developing the language model using RBMs (Restricted Boltzmann Machine). The experiments we performed have given encouraging results. The Cause-Effect system developed is used in a mobile App for Event profiling called “Nigalazhvi” where the cause and effect of an event is identified and given to the user.

2023

pdf bib
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
Jyoti D. Pawar | Sobha Lalitha Devi
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

pdf bib abs
Intent Detection and Zero-shot Intent Classification for Chatbots
Sobha Lalitha Devi | Pattabhi RK. Rao
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

In this paper we give in detail how seen and unseen intent is detected and classified. User intent detection has a critical role in dialogue systems. While analysing the intents it has been found that intents are diversely expressed and new variety of intents emerge continuously. Here we propose a capsule-based approach that classifies the intent and a zero-shot learning to identify the unseen intent. There are recently proposed methods on zero-shot classification which are implemented differently from ours. We have also developed an annotated corpus of free conversations in Tamil, the language we have used for intent classification and for our chatbot. Our proposed method on intent classification performs well.

pdf bib abs
Coreference Resolution Using AdapterFusion-based Multi-Task learning
Sobha Lalitha Devi | Vijay Sundar Ram R. | Pattabhi RK. Rao
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

End-to-end coreference resolution is the task of identifying the mentions in a text that refer to the same real world entity and grouping them into clusters. It is crucially required for natural language understanding tasks and other high-level NLP tasks. In this paper, we present an end-to-end architecture for neural coreference resolution using AdapterFusion, a new two stage learning algorithm that leverages knowledge from multiple tasks. First task is in identifying the mentions in the text and the second to determine the coreference clusters. In the first task we learn task specific parameters called adapters that encapsulate the taskspecific information and then combine the adapters in a separate knowledge composition step to identify the mentions and their clusters. We evaluated it using FIRE corpus for Malayalam and Tamil and we achieved state of art performance.

pdf bib abs
‘ChemXtract’ A System for Extraction of Chemical Events from Patent Documents
Pattabhi RK Rao | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

ChemXtraxt main goal is to extract the chemical events from patent documents. Event extraction requires that we first identify the names of chemical compounds involved in the events. Thus, in this work two extractions are done and they are (a) names of chemical compounds and (b) event that identify the specific involvement of the chemical compounds in a chemical reaction. Extraction of essential elements of a chemical reaction, generally known as Named Entity Recognition (NER), extracts the compounds, condition and yields, their specific role in reaction and assigns a label according to the role it plays within a chemical reaction. Whereas event extraction identifies the chemical event relations between the chemical compounds identified. Here in this work we have used Neural Conditional Random Fields (NCRF), which combines the power of artificial neural network (ANN) and CRFs. Different levels of features that include linguistic, orthographical and lexical clues are used. The results obtained are encouraging.

pdf bib abs
Hindi to Dravidian Language Neural Machine Translation Systems
Vijay Sundar Ram | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Neural machine translation (NMT) has achieved state-of-art performance in high-resource language pairs, but the performance of NMT drops in low-resource conditions. Morphologically rich languages are yet another challenge in NMT. The common strategy to handle this issue is to apply sub-word segmentation. In this work, we compare the morphologically inspired segmentation methods against the Byte Pair Encoding (BPE) in processing the input for building NMT systems for Hindi to Malayalam and Hindi to Tamil, where Hindi is an Indo-Aryan language and Malayalam and Tamil are south Dravidian languages. These two languages are low resource, morphologically rich and agglutinative. Malayalam is more agglutinative than Tamil. We show that for both the language pairs, the morphological segmentation algorithm out-performs BPE. We also present an elaborate analysis on translation outputs from both the NMT systems.

2022

pdf bib abs
Classification of Multiword Expressions in Malayalam
Treesa Cyriac | Sobha Lalitha Devi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Multiword expression is an interesting concept in languages and the MWEs of a language are not easy for a non-native speaker to understand. It includes lexicalized phrases, idioms, collocations etc. Data on multiwords are helpful in language processing. ‘Multiword expressions in Malayalam’ is a less studied area. In this paper, we are trying to explore multiwords in Malayalam and to classify them as per the three idiosyncrasies: semantic idiosyncrasy, syntactic idiosyncrasy, and statistic idiosyncrasy. Though these are already identified, they are not being studied in Malayalam. The classification and features are given and are studied using Malayalam multiwords. Through this study, we identified how the linguistic features of Malayalam such as agglutination influence its multiword expressions in terms of pronunciation and spelling. Malayalam has a set of code-mixed multiword expressions which is also addressed in this study.

pdf bib abs
Automatic Identification of Explicit Connectives in Malayalam
Kumari Sheeja S | Sobha Lalitha Devi
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

This work presents an automatic identification of explicit connectives and its arguments using supervised method, Conditional Random Fields (CRFs). In this work, we focus on the identification of connectives and their arguments in the corpus. We consider explicit connectives and its arguments for the present study. The corpus we have considered has 4,000 sentences from Malayalam documents and manually annotated the corpus for POS, chunk, clause, discourse connectives and its arguments. The corpus thus annotated is used for building the base engine. The percentage of the performance of the system is evaluated based on the precision, recall and F-score and obtained encouraging results. We have analysed the errors generated by the system and used the features obtained from the anlaysis to improve the performance of the system

2021

pdf bib
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
Sivaji Bandyopadhyay | Sobha Lalitha Devi | Pushpak Bhattacharyya
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

pdf bib abs
Dependency Parsing in a Morphological rich language, Tamil
Vijay Sundar Ram | Sobha Lalitha Devi
Proceedings of the First Workshop on Parsing and its Applications for Indian Languages

Dependency parsing is the process of analysing the grammatical structure of a sentence based on the dependencies between the words in a sentence. The annotation of dependency parsing is done using different formalisms at word-level namely Universal Dependencies and chunk-level namely AnnaCorra. Though dependency parsing is deeply dealt in languages such as English, Czech etc the same cannot be adopted for the morphologically rich and agglutinative languages. In this paper, we discuss the development of a dependency parser for Tamil, a South Dravidian language. The different characteristics of the language make this task a challenging task. Tamil, a morphologically rich and agglutinative language, has copula drop, accusative and genitive case drop and pro-drop. Coordinative constructions are introduced by affixation of morpheme ‘um’. Embedded clausal structures are common in relative participle and complementizer clauses. In this paper, we have discussed our approach to handle some of these challenges. We have used Malt parser, a supervised learning- approach based implementation. We have obtained an accuracy of 79.27% for Unlabelled Attachment Score, 73.64% for Labelled Attachment Score and 68.82% for Labelled Accuracy.

2020

pdf bib abs
Handling Noun-Noun Coreference in Tamil
Vijay Sundar Ram | Sobha Lalitha Devi
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Natural language understanding by automatic tools is the vital requirement for document processing tools. To achieve it, automatic system has to understand the coherence in the text. Co-reference chains bring coherence to the text. The commonly occurring reference markers which bring cohesiveness are Pronominal, Reflexives, Reciprocals, Distributives, One-anaphors, Noun–noun reference. Here in this paper, we deal with noun-noun reference in Tamil. We present the methodology to resolve these noun-noun anaphors and also present the challenges in handling the noun-noun anaphoric relations in Tamil.

pdf bib abs
A Deeper Study on Features for Named Entity Recognition
Malarkodi C S | Sobha Lalitha Devi
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

This paper deals with the various features used for the identification of named entities. The performance of the machine learning system heavily depends on the feature selection criteria. The intention to trace the essential features required for the development of named entity system across languages motivated us to conduct this study. The linguistic analysis was done to find out the part of speech patterns surrounding the context of named entities and from the observation linguistic oriented features are identified for both Indian and European languages. The Indian languages belongs to Dravidian language family such as Tamil, Telugu, Malayalam, Indo-Aryan language family such as Hindi, Punjabi, Bengali and Marathi, European languages such as English, Spanish, Dutch, German and Hungarian are used in this work. The machine learning technique CRFs was used for the system development. The experiments were conducted using the linguistic features and the results obtained for each languages are comparable with state-of-art systems.

2019

pdf bib abs
Resolving Pronouns for a Resource-Poor Language, Malayalam Using Resource-Rich Language, Tamil.
Sobha Lalitha Devi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper we give in detail how a resource rich language can be used for resolving pronouns for a less resource language. The source language, which is resource rich language in this study, is Tamil and the resource poor language is Malayalam, both belonging to the same language family, Dravidian. The Pronominal resolution developed for Tamil uses CRFs. Our approach is to leverage the Tamil language model to test Malayalam data and the processing required for Malayalam data is detailed. The similarity at the syntactic level between the languages is exploited in identifying the features for developing the Tamil language model. The word form or the lexical item is not considered as a feature for training the CRFs. Evaluation on Malayalam Wikipedia data shows that our approach is correct and the results, though not as good as Tamil, but comparable.

2017

pdf bib
Scalable Bio-Molecular Event Extraction System towards Knowledge Acquisition
Pattabhi RK Rao | Sindhuja Gopalan | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
Co-reference Resolution in Tamil Text
Vijay Sundar Ram | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf bib
Cross Linguistic Variations in Discourse Relations among Indian Languages
Sindhuja Gopalan | Lakshmi S | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib
How to Handle Split Antecedents in Tamil?
Vijay Sundar Ram | Sobha Lalitha Devi
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib abs
BioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text
Sindhuja Gopalan | Sobha Lalitha Devi
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a corpus annotated with discourse relations on Medline abstracts. Kappa score is calculated to check the annotation quality of our corpus. The previous works on discourse analysis in bio-medical data have concentrated only on the identification of connectives and hence we have developed an end-end parser for connective and argument identification using Conditional Random Fields algorithm. The type and sub-type of the connective sense is also identified. The results obtained are encouraging.

Co-authors

Venues