Soman Kp

Also published as: Soman KP


2021

pdf bib
Amrita_CEN_NLP@DravidianLangTech-EACL2021: Deep Learning-based Offensive Language Identification in Malayalam, Tamil and Kannada
Sreelakshmi K | Premjith B | Soman Kp
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper describes the submission of the team Amrita_CEN_NLP to the shared task on Offensive Language Identification in Dravidian Languages at EACL 2021. We implemented three deep neural network architectures such as a hybrid network with a Convolutional layer, a Bidirectional Long Short-Term Memory network (Bi-LSTM) layer and a hidden layer, a network containing a Bi-LSTM and another with a Bidirectional Recurrent Neural Network (Bi-RNN). In addition to that, we incorporated a cost-sensitive learning approach to deal with the problem of class imbalance in the training data. Among the three models, the hybrid network exhibited better training performance, and we submitted the predictions based on the same.

pdf bib
Mining Bilingual Word Pairs from Comparable Corpus using Apache Spark Framework
Sanjanasri Jp | Vijay Krishna Menon | Soman Kp | Krzysztof Wolk
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Bilingual dictionaries are essential resources in many areas of natural language processing tasks, but resource-scarce and less popular language pairs rarely have such. Efficient automatic methods for inducting bilingual dictionaries are needed as manual resources and efforts are scarce for low-resourced languages. In this paper, we induce word translations using bilingual embedding. We use the Apache Spark framework for parallel computation. Further, to validate the quality of the generated bilingual dictionary, we use it in a phrase-table aided Neural Machine Translation (NMT) system. The system can perform moderately well with a manual bilingual dictionary; we change this into our inducted dictionary. The corresponding translated outputs are compared using the Bilingual Evaluation Understudy (BLEU) and Rank-based Intuitive Bilingual Evaluation Score (RIBES) metrics.

pdf bib
Amrita_CEN_NLP@SDP2021 Task A and B
Premjith B | Isha Indhu S | Kavya S. Kumar | Lakshaya Karthikeyan | Soman Kp
Proceedings of the Second Workshop on Scholarly Document Processing

The purpose and influence of a citation are important in understanding the quality of a publication. The 3c citation context classification shared task at the Second Workshop on Scholarly Document Processing aims at addressing this problem. This paper is the submission of the team Amrita_CEN_NLP to the shared task. We employed Bi-directional Long Short Term Memory (LSTM) networks and a Random Forest classifier for modelling the aforementioned problems by considering the class imbalance problem in the data.

2020

pdf bib
Amrita_CEN_NLP @ WOSP 3C Citation Context Classification Task
Premjith B | Soman KP
Proceedings of the 8th International Workshop on Mining Scientific Publications

Identification of the purpose and influence of citation is significant in assessing the impact of a publication. ‘3C’ Citation Context Classification Task in Workshop on Mining Scientific Publication is a shared task to address the abovementioned problems. This working note describes the submissions of Amrita_CEN_NLP team to the shared task. We used Random Forest with cost-sensitive learning for classification of sentences encoded into a vector of dimension 300.

pdf bib
cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus
Sanjanasri JP | Premjith B | Vijay Krishna Menon | Soman KP
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data driven process. It employs machine learning and statistical algorithms to learn language structures from textual corpus. While application of NLP in English, certain European languages such as Spanish, German, etc. and Chinese, Arabic has been tremendous, it is not so, in many Indian languages. There are obvious advantages in creating aligned bilingual and multilingual corpora. Machine translation, cross-lingual information retrieval, content availability and linguistic comparison are a few of the most sought after applications of such parallel corpora. This paper explains and validates a parallel corpus we created for English-Tamil bilingual pair.

pdf bib
BUCC2020: Bilingual Dictionary Induction using Cross-lingual Embedding
Sanjanasri JP | Vijay Krishna Menon | Soman KP
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

This paper presents a deep learning system for the BUCC 2020 shared task: Bilingual dictionary induction from comparable corpora. We have submitted two runs for this shared Task, German (de) and English (en) language pair for “closed track” and Tamil (ta) and English (en) for the “open track”. Our core approach focuses on quantifying the semantics of the language pairs, so that semantics of two different language pairs can be compared or transfer learned. With the advent of word embeddings, it is possible to quantify this. In this paper, we propose a deep learning approach which makes use of the supplied training data, to generate cross-lingual embedding. This is later used for inducting bilingual dictionary from comparable corpora.

2019

pdf bib
A Machine Learning Approach for Identifying Compound Words from a Sanskrit Text
Premjith B | Chandni Chandran V | Shriganesh Bhat | Soman Kp | Prabaharan P
Proceedings of the 6th International Sanskrit Computational Linguistics Symposium

2017

pdf bib
deepCybErNet at EmoInt-2017: Deep Emotion Intensities in Tweets
Vinayakumar R | Premjith B | Sachin Kumar S | Soman KP | Prabaharan Poornachandran
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This working note presents the methodology used in deepCybErNet submission to the shared task on Emotion Intensities in Tweets (EmoInt) WASSA-2017. The goal of the task is to predict a real valued score in the range [0-1] for a particular tweet with an emotion type. To do this, we used Bag-of-Words and embedding based on recurrent network architecture. We have developed two systems and experiments are conducted on the Emotion Intensity shared Task 1 data base at WASSA-2017. A system which uses word embedding based on recurrent network architecture has achieved highest 5 fold cross-validation accuracy. This has used embedding with recurrent network to extract optimal features at tweet level and logistic regression for prediction. These methods are highly language independent and experimental results shows that the proposed methods are apt for predicting a real valued score in than range [0-1] for a given tweet with its emotion type.

2016

pdf bib
Amrita_CEN at SemEval-2016 Task 1: Semantic Relation from Word Embeddings in Higher Dimension
Barathi Ganesh HB | Anand Kumar M | Soman KP
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)