Sindhuja Gopalan

2023

pdf abs
Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language
Bhavuk Singhal | Sindhuja Gopalan | Amrith Krishna | Malolan Chetlur
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

ITN involves rewriting the verbalised form of text from spoken transcripts to its corresponding written form. The task inherently expects challenges in identifying ITN entries due to spelling variations in words arising out of dialects, transcription errors etc. Additionally, in Tamil, word boundaries between adjacent words in a sentence often get obscured due to Punarchi, i.e. phonetic transformation of these boundaries. Being morphologically rich, the words in Tamil show a high degree of agglutination due to inflection and clitics. The combination of such factors leads to a high degree of surface-form variations, making scalability with pure rule-based approaches difficult. Instead, we experiment with fine-tuning three pre-trained neural LMs, consisting of a seq2seq model (s2s), a non-autoregressive text editor (NAR) and a sequence tagger + rules combination (tagger). While the tagger approach works best in a fully-supervised setting, s2s performs the best (98.05 F-Score) when augmented with additional data, via bootstrapping and data augmentation (DA&B). S2S reports a cumulative percentage improvement of 20.1 %, and statistically significant gains for all our models with DA&B. Compared to a fully supervised setup, bootstrapping alone reports a percentage improvement as high as 14.12 %, even with a small seed set of 324 ITN entries.

2017

pdf
Scalable Bio-Molecular Event Extraction System towards Knowledge Acquisition
Pattabhi RK Rao | Sindhuja Gopalan | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Cross Linguistic Variations in Discourse Relations among Indian Languages
Sindhuja Gopalan | Lakshmi S | Sobha Lalitha Devi
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf abs
BioDCA Identifier: A System for Automatic Identification of Discourse Connective and Arguments from Biomedical Text
Sindhuja Gopalan | Sobha Lalitha Devi
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

This paper describes a Natural language processing system developed for automatic identification of explicit connectives, its sense and arguments. Prior work has shown that the difference in usage of connectives across corpora affects the cross domain connective identification task negatively. Hence the development of domain specific discourse parser has become indispensable. Here, we present a corpus annotated with discourse relations on Medline abstracts. Kappa score is calculated to check the annotation quality of our corpus. The previous works on discourse analysis in bio-medical data have concentrated only on the identification of connectives and hence we have developed an end-end parser for connective and argument identification using Conditional Random Fields algorithm. The type and sub-type of the connective sense is also identified. The results obtained are encouraging.

2015

Co-authors

Bhavuk Singhal 1

Amrith Krishna 1

Malolan Chetlur 1