2022
pdf
abs
CNLP-NITS-PP at WANLP 2022 Shared Task: Propaganda Detection in Arabic using Data Augmentation and AraBERT Pre-trained Model
Sahinur Rahman Laskar
|
Rahul Singh
|
Abdullah Faiz Ur Rahman Khilji
|
Riyanka Manna
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)
In today’s time, online users are regularly exposed to media posts that are propagandistic. Several strategies have been developed to promote safer media consumption in Arabic to combat this. However, there is a limited available multilabel annotated social media dataset. In this work, we have used a pre-trained AraBERT twitter-base model on an expanded train data via data augmentation. Our team CNLP-NITS-PP, has achieved the third rank in subtask 1 at WANLP-2022, for propaganda detection in Arabic (shared task) in terms of micro-F1 score of 0.602.
2021
pdf
abs
CNLP-NITS @ LongSumm 2021: TextRank Variant for Generating Long Summaries
Darsh Kaushik
|
Abdullah Faiz Ur Rahman Khilji
|
Utkarsh Sinha
|
Partha Pakray
Proceedings of the Second Workshop on Scholarly Document Processing
The huge influx of published papers in the field of machine learning makes the task of summarization of scholarly documents vital, not just to eliminate the redundancy but also to provide a complete and satisfying crux of the content. We participated in LongSumm 2021: The 2nd Shared Task on Generating Long Summaries for scientific documents, where the task is to generate long summaries for scientific papers provided by the organizers. This paper discusses our extractive summarization approach to solve the task. We used TextRank algorithm with the BM25 score as a similarity function. Even after being a graph-based ranking algorithm that does not require any learning, TextRank produced pretty decent results with minimal compute power and time. We attained 3rd rank according to ROUGE-1 scores (0.5131 for F-measure and 0.5271 for recall) and performed decently as shown by the ROUGE-2 scores.
pdf
abs
Improved English to Hindi Multimodal Neural Machine Translation
Sahinur Rahman Laskar
|
Abdullah Faiz Ur Rahman Khilji
|
Darsh Kaushik
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
Machine translation performs automatic translation from one natural language to another. Neural machine translation attains a state-of-the-art approach in machine translation, but it requires adequate training data, which is a severe problem for low-resource language pairs translation. The concept of multimodal is introduced in neural machine translation (NMT) by merging textual features with visual features to improve low-resource pair translation. WAT2021 (Workshop on Asian Translation 2021) organizes a shared task of multimodal translation for English to Hindi. We have participated the same with team name CNLP-NITS-PP in two submissions: multimodal and text-only NMT. This work investigates phrase pairs injection via data augmentation approach and attains improvement over our previous work at WAT2020 on the same task in both text-only and multimodal NMT. We have achieved second rank on the challenge test set for English to Hindi multimodal translation where Bilingual Evaluation Understudy (BLEU) score of 39.28, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.792097, and Adequacy-Fluency Metrics (AMFM) score 0.830230 respectively.
2020
pdf
abs
Multimodal Neural Machine Translation for English to Hindi
Sahinur Rahman Laskar
|
Abdullah Faiz Ur Rahman Khilji
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the 7th Workshop on Asian Translation
Machine translation (MT) focuses on the automatic translation of text from one natural language to another natural language. Neural machine translation (NMT) achieves state-of-the-art results in the task of machine translation because of utilizing advanced deep learning techniques and handles issues like long-term dependency, and context-analysis. Nevertheless, NMT still suffers low translation quality for low resource languages. To encounter this challenge, the multi-modal concept comes in. The multi-modal concept combines textual and visual features to improve the translation quality of low resource languages. Moreover, the utilization of monolingual data in the pre-training step can improve the performance of the system for low resource language translations. Workshop on Asian Translation 2020 (WAT2020) organized a translation task for multimodal translation in English to Hindi. We have participated in the same in two-track submission, namely text-only and multi-modal translation with team name CNLP-NITS. The evaluated results are declared at the WAT2020 translation task, which reports that our multi-modal NMT system attained higher scores than our text-only NMT on both challenge and evaluation test set. For the challenge test data, our multi-modal neural machine translation system achieves Bilingual Evaluation Understudy (BLEU) score of 33.57, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.754141, Adequacy-Fluency Metrics (AMFM) score 0.787320 and for evaluation test data, BLEU, RIBES, and, AMFM score of 40.51, 0.803208, and 0.820980 for English to Hindi translation respectively.
pdf
abs
Hindi-Marathi Cross Lingual Model
Sahinur Rahman Laskar
|
Abdullah Faiz Ur Rahman Khilji
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the Fifth Conference on Machine Translation
Machine Translation (MT) is a vital tool for aiding communication between linguistically separate groups of people. The neural machine translation (NMT) based approaches have gained widespread acceptance because of its outstanding performance. We have participated in WMT20 shared task of similar language translation on Hindi-Marathi pair. The main challenge of this task is by utilization of monolingual data and similarity features of similar language pair to overcome the limitation of available parallel data. In this work, we have implemented NMT based model that simultaneously learns bilingual embedding from both the source and target language pairs. Our model has achieved Hindi to Marathi bilingual evaluation understudy (BLEU) score of 11.59, rank-based intuitive bilingual evaluation score (RIBES) score of 57.76 and translation edit rate (TER) score of 79.07 and Marathi to Hindi BLEU score of 15.44, RIBES score of 61.13 and TER score of 75.96.
pdf
abs
Human Behavior Assessment using Ensemble Models
Abdullah Faiz Ur Rahman Khilji
|
Rituparna Khaund
|
Utkarsh Sinha
Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association
Behavioral analysis is a pertinent step in today’s automated age. It is important to judge a statement on a variety of parameters before reaching a valid conclusion. In today’s world of technology and automation, Natural language processing tools have benefited from growing access to data in order to analyze the context and scenario. A better understanding of human behaviors would empower a range of automated tools to provide users a customized experience. For precise analysis, behavior understanding is important. We have experimented with various machine learning techniques, and have obtained a maximum private score of 0.1033 with a public score of 0.1733. The methods are described as part of the ALTA 2020 shared task. In this work, we have enlisted our results and the challenges faced to solve the problem of the human behavior assessment.
pdf
abs
Zero-Shot Neural Machine Translation: Russian-Hindi @LoResMT 2020
Sahinur Rahman Laskar
|
Abdullah Faiz Ur Rahman Khilji
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Neural machine translation (NMT) is a widely accepted approach in the machine translation (MT) community, translating from one natural language to another natural language. Although, NMT shows remarkable performance in both high and low resource languages, it needs sufficient training corpus. The availability of a parallel corpus in low resource language pairs is one of the challenging tasks in MT. To mitigate this issue, NMT attempts to utilize a monolingual corpus to get better at translation for low resource language pairs. Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020) organized shared tasks of low resource language pair translation using zero-shot NMT. Here, the parallel corpus is not used and only monolingual corpora is allowed. We have participated in the same shared task with our team name CNLP-NITS for the Russian-Hindi language pair. We have used masked sequence to sequence pre-training for language generation (MASS) with only monolingual corpus following the unsupervised NMT architecture. The evaluated results are declared at the LoResMT 2020 shared task, which reports that our system achieves the bilingual evaluation understudy (BLEU) score of 0.59, precision score of 3.43, recall score of 5.48, F-measure score of 4.22, and rank-based intuitive bilingual evaluation score (RIBES) of 0.180147 in Russian to Hindi translation. And for Hindi to Russian translation, we have achieved BLEU, precision, recall, F-measure, and RIBES score of 1.11, 4.72, 4.41, 4.56, and 0.026842 respectively.
pdf
abs
EnAsCorp1.0: English-Assamese Corpus
Sahinur Rahman Laskar
|
Abdullah Faiz Ur Rahman Khilji
|
Partha Pakray
|
Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
The corpus preparation is one of the important challenging task for the domain of machine translation especially in low resource language scenarios. Country like India where multiple languages exists, machine translation attempts to minimize the communication gap among people with different linguistic backgrounds. Although Google Translation covers automatic translation of various languages all over the world but it lags in some languages including Assamese. In this paper, we have developed EnAsCorp1.0, corpus of English-Assamese low resource pair where parallel and monolingual data are collected from various online sources. We have also implemented baseline systems with statistical machine translation and neural machine translation approaches for the same corpus.