2022
pdf
abs
Named Entity Recognition Based Automatic Generation of Research Highlights
Tohida Rehman
|
Debarshi Kumar Sanyal
|
Prasenjit Majumder
|
Samiran Chattopadhyay
Proceedings of the Third Workshop on Scholarly Document Processing
A scientific paper is traditionally prefaced by an abstract that summarizes the paper. Recently, research highlights that focus on the main findings of the paper have emerged as a complementary summary in addition to an abstract. However, highlights are not yet as common as abstracts, and are absent in many papers. In this paper, we aim to automatically generate research highlights using different sections of a research paper as input. We investigate whether the use of named entity recognition on the input improves the quality of the generated highlights. In particular, we have used two deep learning-based models: the first is a pointer-generator network, and the second augments the first model with coverage mechanism. We then augment each of the above models with named entity recognition features. The proposed method can be used to produce highlights for papers with missing highlights. Our experiments show that adding named entity information improves the performance of the deep learning-based summarizers in terms of ROUGE, METEOR and BERTScore measures.
2021
pdf
abs
IRLAB-DAIICT@DravidianLangTech-EACL2021: Neural Machine Translation
Raj Prajapati
|
Vedant Vijay Parikh
|
Prasenjit Majumder
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
This paper describes our team’s submission of the EACL DravidianLangTech-2021’s shared task on Machine Translation of Dravidian languages.We submitted our translations for English to Malayalam , Tamil , Telugu and also Tamil-Telugu language pairs. The submissions mainly focus on having adequate amount of data backed up by good preprocessing of it to produce quality translations,which includes some custom made rules to remove unnecessary sentences. We conducted several experiments on these models by tweaking the architecture,Byte Pair Encoding (BPE) and other hyperparameters.
pdf
abs
IRNLP_DAIICT@DravidianLangTech-EACL2021:Offensive Language identification in Dravidian Languages using TF-IDF Char N-grams and MuRIL
Bhargav Dave
|
Shripad Bhat
|
Prasenjit Majumder
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
This paper presents the participation of the IRNLPDAIICT team from Information Retrieval and Natural Language Processing lab at DA-IICT, India in DravidianLangTech-EACL2021 Offensive Language identification in Dravidian Languages. The aim of this shared task is to identify Offensive Language from a code-mixed data-set of YouTube comments. The task is to classify comments into Not Offensive (NO), Offensive Untargetede(OU), Offensive Targeted Individual (OTI), Offensive Targeted Group (OTG), Offensive Targeted Others (OTO), Other Language (OL) for three Dravidian languages: Kannada, Malayalam and Tamil. We use TF-IDF character n-grams and pretrained MuRIL embeddings for text representation and Logistic Regression and Linear SVM for classification. Our best approach achieved Ninth, Third and Eighth with weighted F1 score of 0.64, 0.95 and 0.71in Kannada, Malayalam and Tamil on test dataset respectively.
pdf
abs
IRNLP_DAIICT@LT-EDI-EACL2021: Hope Speech detection in Code Mixed text using TF-IDF Char N-grams and MuRIL
Bhargav Dave
|
Shripad Bhat
|
Prasenjit Majumder
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion
This paper presents the participation of the IRNLP_DAIICT team from Information Retrieval and Natural Language Processing lab at DA-IICT, India in LT-EDI@EACL2021 Hope Speech Detection task. The aim of this shared task is to identify hope speech from a code-mixed data-set of YouTube comments. The task is to classify comments into Hope Speech, Non Hope speech or Not in language, for three languages: English, Malayalam-English and Tamil-English. We use TF-IDF character n-grams and pretrained MuRIL embeddings for text representation and Logistic Regression and Linear SVM for classification. Our best approach achieved second, eighth and fifth rank with weighted F1 score of 0.92, 0.75 and 0.57 in English, Malayalam-English and Tamil-English on test dataset respectively
2020
pdf
abs
IRLab_DAIICT at SemEval-2020 Task 9: Machine Learning and Deep Learning Methods for Sentiment Analysis of Code-Mixed Tweets
Apurva Parikh
|
Abhimanyu Singh Bisht
|
Prasenjit Majumder
Proceedings of the Fourteenth Workshop on Semantic Evaluation
The paper describes systems that our team IRLab_DAIICT employed for the shared task Sentiment Analysis for Code-Mixed Social Media Text in SemEval 2020. We conducted our experiments on a Hindi-English CodeMixed Tweet dataset which was annotated with sentiment labels. F1-score was the official evaluation metric and our best approach, an ensemble of Logistic Regression, Random Forest and BERT, achieved an F1-score of 0.693.
pdf
abs
IRLab_DAIICT at SemEval-2020 Task 12: Machine Learning and Deep Learning Methods for Offensive Language Identification
Apurva Parikh
|
Abhimanyu Singh Bisht
|
Prasenjit Majumder
Proceedings of the Fourteenth Workshop on Semantic Evaluation
The paper describes systems that our team IRLab_DAIICT employed for shared task OffensEval2020: Multilingual Offensive Language Identification in Social Media shared task. We conducted experiments on the English language dataset which contained weakly labelled data. There were three sub-tasks but we only participated in sub-tasks A and B. We employed Machine learning techniques like Logistic Regression, Support Vector Machine, Random Forest and Deep learning techniques like Convolutional Neural Network and BERT. Our best approach achieved a MacroF1 score of 0.91 for sub-task A and 0.64 for sub-task B.
2019
pdf
abs
DA-LD-Hildesheim at SemEval-2019 Task 6: Tracking Offensive Content with Deep Learning using Shallow Representation
Sandip Modha
|
Prasenjit Majumder
|
Daksh Patel
Proceedings of the 13th International Workshop on Semantic Evaluation
This paper presents the participation of team DA-LD-Hildesheim of Information Retrieval and Language Processing lab at DA-IICT, India in Semeval-19 OffenEval track. The aim of this shared task is to identify offensive content at fined-grained level granularity. The task is divided into three sub-tasks. The system is required to check whether social media posts contain any offensive or profane content or not, targeted or untargeted towards any entity and classifying targeted posts into the individual, group or other categories. Social media posts suffer from data sparsity problem, Therefore, the distributed word representation technique is chosen over the Bag-of-Words for the text representation. Since limited labeled data was available for the training, pre-trained word vectors are used and fine-tuned on this classification task. Various deep learning models based on LSTM, Bidirectional LSTM, CNN, and Stacked CNN are used for the classification. It has been observed that labeled data was highly affected with class imbalance and our technique to handle the class-balance was not effective, in fact performance was degraded in some of the runs. Macro F1 score is used as a primary evaluation metric for the performance. Our System achieves Macro F1 score = 0.7833 in sub-task A, 0.6456 in the sub-task B and 0.5533 in the sub-task C.
2018
pdf
abs
Filtering Aggression from the Multilingual Social Media Feed
Sandip Modha
|
Prasenjit Majumder
|
Thomas Mandl
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)
This paper describes the participation of team DA-LD-Hildesheim from the Information Retrieval Lab(IRLAB) at DA-IICT Gandhinagar, India in collaboration with the University of Hildesheim, Germany and LDRP-ITR, Gandhinagar, India in a shared task on Aggression Identification workshop in COLING 2018. The objective of the shared task is to identify the level of aggression from the User-Generated contents within Social media written in English, Devnagiri Hindi and Romanized Hindi. Aggression levels are categorized into three predefined classes namely: ‘Overtly Aggressive‘, ‘Covertly Aggressive‘ and ‘Non-aggressive‘. The participating teams are required to develop a multi-class classifier which classifies User-generated content into these pre-defined classes. Instead of relying on a bag-of-words model, we have used pre-trained vectors for word embedding. We have performed experiments with standard machine learning classifiers. In addition, we have developed various deep learning models for the multi-class classification problem. Using the validation data, we found that validation accuracy of our deep learning models outperform all standard machine learning classifiers and voting based ensemble techniques and results on test data support these findings. We have also found that hyper-parameters of the deep neural network are the keys to improve the results.
2016
pdf
DA-IICT Submission for PDTB-styled Discourse Parser
Devanshu Jain
|
Prasenjit Majumder
Proceedings of the CoNLL-16 shared task
2013
pdf
Optimum Parameter Selection for K.L.D. Based Authorship Attribution in Gujarati
Parth Mehta
|
Prasenjit Majumder
Proceedings of the Sixth International Joint Conference on Natural Language Processing
2011
pdf
Soundex-based Translation Correction in Urdu–English Cross-Language Information Retrieval
Manaal Faruqui
|
Prasenjit Majumder
|
Sebastian Padó
Proceedings of the Fifth International Workshop On Cross Lingual Information Access