Sahinur Rahman Laskar


2023

pdf
Findings of the WMT 2023 Shared Task on Low-Resource Indic Language Translation
Santanu Pal | Partha Pakray | Sahinur Rahman Laskar | Lenin Laitonjam | Vanlalmuansangi Khenglawt | Sunita Warjri | Pankaj Kundan Dadure | Sandeep Kumar Dash
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the low-resource Indic language translation task organized alongside the Eighth Conference on Machine Translation (WMT) 2023. In this task, participants were asked to build machine translation systems for any of four language pairs, namely, English-Assamese, English-Mizo, English-Khasi, and English-Manipuri. For this task, the IndicNE-Corp1.0 dataset is released, which consists of parallel and monolingual corpora for northeastern Indic languages such as Assamese, Mizo, Khasi, and Manipuri. The evaluation will be carried out using automatic evaluation metrics (BLEU, TER, RIBES, COMET, ChrF) and human evaluation.

2022

pdf
Investigation of Multilingual Neural Machine Translation for Indian Languages
Sahinur Rahman Laskar | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 9th Workshop on Asian Translation

In the domain of natural language processing, machine translation is a well-defined task where one natural language is automatically translated to another natural language. The deep learning-based approach of machine translation, known as neural machine translation attains remarkable translational performance. However, it requires a sufficient amount of training data which is a critical issue for low-resource pair translation. To handle the data scarcity problem, the multilingual concept has been investigated in neural machine translation in different settings like many-to-one and one-to-many translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) Indic tasks: English-to-Indic and Indic-to-English translation tasks where we have participated as a team named CNLP-NITS-PP. Herein, we have investigated a transliteration-based approach, where Indic languages are transliterated into English script and shared sub-word level vocabulary during the training phase. We have attained BLEU scores of 2.0 (English-to-Bengali), 1.10 (English-to-Assamese), 4.50 (Bengali-to-English), and 3.50 (Assamese-to-English) translation, respectively.

pdf
English to Bengali Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation
Sahinur Rahman Laskar | Pankaj Dadure | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 9th Workshop on Asian Translation

Automatic translation of one natural language to another is a popular task of natural language processing. Although the deep learning-based technique known as neural machine translation (NMT) is a widely accepted machine translation approach, it needs an adequate amount of training data, which is a challenging issue for low-resource pair translation. Moreover, the multimodal concept utilizes text and visual features to improve low-resource pair translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) English to Bengali multimodal translation task where we have participated as a team named CNLP-NITS-PP in two tracks: 1) text-only and 2) multimodal translation. Herein, we have proposed a transliteration-based phrase pairs augmentation approach which shows improvement in the multimodal translation task and achieved benchmark results on Bengali Visual Genome 1.0 dataset. We have attained the best results on the challenge and evaluation test set for English to Bengali multimodal translation with BLEU scores of 28.70, 43.90 and RIBES scores of 0.688931, 0.780669, respectively.

pdf
Investigation of English to Hindi Multimodal Neural Machine Translation using Transliteration-based Phrase Pairs Augmentation
Sahinur Rahman Laskar | Rahul Singh | Md Faizal Karim | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 9th Workshop on Asian Translation

Machine translation translates one natural language to another, a well-defined natural language processing task. Neural machine translation (NMT) is a widely accepted machine translation approach, but it requires a sufficient amount of training data, which is a challenging issue for low-resource pair translation. Moreover, the multimodal concept utilizes text and visual features to improve low-resource pair translation. WAT2022 (Workshop on Asian Translation 2022) organizes (hosted by the COLING 2022) English to Hindi multimodal translation task where we have participated as a team named CNLP-NITS-PP in two tracks: 1) text-only and 2) multimodal translation. Herein, we have proposed a transliteration-based phrase pairs augmentation approach, which shows improvement in the multimodal translation task. We have attained the second best results on the challenge test set for English to Hindi multimodal translation with BLEU score of 39.30, and a RIBES score of 0.791468.

pdf
Language Resource Building and English-to-Mizo Neural Machine Translation Encountering Tonal Words
Vanlalmuansangi Khenglawt | Sahinur Rahman Laskar | Santanu Pal | Partha Pakray | Ajoy Kumar Khan
Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference

Multilingual country like India has an enormous linguistic diversity and has an increasing demand towards developing language resources such that it will outreach in various natural language processing applications like machine translation. Low-resource language translation possesses challenges in the field of machine translation. The challenges include the availability of corpus and differences in linguistic information. This paper investigates a low-resource language pair, English-to-Mizo exploring neural machine translation by contributing an Indian language resource, i.e., English-Mizo corpus. In this work, we explore one of the main challenges to tackling tonal words existing in the Mizo language, as they add to the complexity on top of low-resource challenges for any natural language processing task. Our approach improves translation accuracy by encountering tonal words of Mizo and achieved a state-of-the-art result in English-to-Mizo translation.

pdf
CNLP-NITS-PP at WANLP 2022 Shared Task: Propaganda Detection in Arabic using Data Augmentation and AraBERT Pre-trained Model
Sahinur Rahman Laskar | Rahul Singh | Abdullah Faiz Ur Rahman Khilji | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

In today’s time, online users are regularly exposed to media posts that are propagandistic. Several strategies have been developed to promote safer media consumption in Arabic to combat this. However, there is a limited available multilabel annotated social media dataset. In this work, we have used a pre-trained AraBERT twitter-base model on an expanded train data via data augmentation. Our team CNLP-NITS-PP, has achieved the third rank in subtask 1 at WANLP-2022, for propaganda detection in Arabic (shared task) in terms of micro-F1 score of 0.602.

pdf
CNLP-NITS-PP at MixMT 2022: Hinglish-English Code-Mixed Machine Translation
Sahinur Rahman Laskar | Rahul Singh | Shyambabu Pandey | Riyanka Manna | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the Seventh Conference on Machine Translation (WMT)

The mixing of two or more languages in speech or text is known as code-mixing. In this form of communication, users mix words and phrases from multiple languages. Code-mixing is very common in the context of Indian languages due to the presence of multilingual societies. The probability of the existence of code-mixed sentences in almost all Indian languages since in India English is the dominant language for social media textual communication platforms. We have participated in the WMT22 shared task of code-mixed machine translation with the team name: CNLP-NITS-PP. In this task, we have prepared a synthetic Hinglish–English parallel corpus using transliteration of original Hindi sentences to tackle the limitation of the parallel corpus, where, we mainly considered sentences that have named-entity (proper noun) from the available English-Hindi parallel corpus. With the addition of synthetic bi-text data to the original parallel corpus (train set), our transformer-based neural machine translation models have attained recall-oriented understudy for gisting evaluation (ROUGE-L) scores of 0.23815, 0.33729, and word error rate (WER) scores of 0.95458, 0.88451 at Sub-Task-1 (English-to-Hinglish) and Sub-Task-2 (Hinglish-to-English) for test set results respectively.

2021

pdf
EnKhCorp1.0: An English–Khasi Corpus
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji Darsh Kaushik | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

In machine translation, corpus preparation is one of the crucial tasks, particularly for lowresource pairs. In multilingual countries like India, machine translation plays a vital role in communication among people with various linguistic backgrounds. There are available online automatic translation systems by Google and Microsoft which include various languages which lack support for the Khasi language, which can hence be considered lowresource. This paper overviews the development of EnKhCorp1.0, a corpus for English–Khasi pair, and implemented baseline systems for EnglishtoKhasi and KhasitoEnglish translation based on the neural machine translation approach.

pdf
Improved English to Hindi Multimodal Neural Machine Translation
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Darsh Kaushik | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

Machine translation performs automatic translation from one natural language to another. Neural machine translation attains a state-of-the-art approach in machine translation, but it requires adequate training data, which is a severe problem for low-resource language pairs translation. The concept of multimodal is introduced in neural machine translation (NMT) by merging textual features with visual features to improve low-resource pair translation. WAT2021 (Workshop on Asian Translation 2021) organizes a shared task of multimodal translation for English to Hindi. We have participated the same with team name CNLP-NITS-PP in two submissions: multimodal and text-only NMT. This work investigates phrase pairs injection via data augmentation approach and attains improvement over our previous work at WAT2020 on the same task in both text-only and multimodal NMT. We have achieved second rank on the challenge test set for English to Hindi multimodal translation where Bilingual Evaluation Understudy (BLEU) score of 39.28, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.792097, and Adequacy-Fluency Metrics (AMFM) score 0.830230 respectively.

pdf
Neural Machine Translation for Tamil–Telugu Pair
Sahinur Rahman Laskar | Bishwaraj Paul | Prottay Kumar Adhikary | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the Sixth Conference on Machine Translation

The neural machine translation approach has gained popularity in machine translation because of its context analysing ability and its handling of long-term dependency issues. We have participated in the WMT21 shared task of similar language translation on a Tamil-Telugu pair with the team name: CNLP-NITS. In this task, we utilized monolingual data via pre-train word embeddings in transformer model based neural machine translation to tackle the limitation of parallel corpus. Our model has achieved a bilingual evaluation understudy (BLEU) score of 4.05, rank-based intuitive bilingual evaluation score (RIBES) score of 24.80 and translation edit rate (TER) score of 97.24 for both Tamil-to-Telugu and Telugu-to-Tamil translations respectively.

2020

pdf
Multimodal Neural Machine Translation for English to Hindi
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 7th Workshop on Asian Translation

Machine translation (MT) focuses on the automatic translation of text from one natural language to another natural language. Neural machine translation (NMT) achieves state-of-the-art results in the task of machine translation because of utilizing advanced deep learning techniques and handles issues like long-term dependency, and context-analysis. Nevertheless, NMT still suffers low translation quality for low resource languages. To encounter this challenge, the multi-modal concept comes in. The multi-modal concept combines textual and visual features to improve the translation quality of low resource languages. Moreover, the utilization of monolingual data in the pre-training step can improve the performance of the system for low resource language translations. Workshop on Asian Translation 2020 (WAT2020) organized a translation task for multimodal translation in English to Hindi. We have participated in the same in two-track submission, namely text-only and multi-modal translation with team name CNLP-NITS. The evaluated results are declared at the WAT2020 translation task, which reports that our multi-modal NMT system attained higher scores than our text-only NMT on both challenge and evaluation test set. For the challenge test data, our multi-modal neural machine translation system achieves Bilingual Evaluation Understudy (BLEU) score of 33.57, Rank-based Intuitive Bilingual Evaluation Score (RIBES) 0.754141, Adequacy-Fluency Metrics (AMFM) score 0.787320 and for evaluation test data, BLEU, RIBES, and, AMFM score of 40.51, 0.803208, and 0.820980 for English to Hindi translation respectively.

pdf
Zero-Shot Neural Machine Translation: Russian-Hindi @LoResMT 2020
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Neural machine translation (NMT) is a widely accepted approach in the machine translation (MT) community, translating from one natural language to another natural language. Although, NMT shows remarkable performance in both high and low resource languages, it needs sufficient training corpus. The availability of a parallel corpus in low resource language pairs is one of the challenging tasks in MT. To mitigate this issue, NMT attempts to utilize a monolingual corpus to get better at translation for low resource language pairs. Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020) organized shared tasks of low resource language pair translation using zero-shot NMT. Here, the parallel corpus is not used and only monolingual corpora is allowed. We have participated in the same shared task with our team name CNLP-NITS for the Russian-Hindi language pair. We have used masked sequence to sequence pre-training for language generation (MASS) with only monolingual corpus following the unsupervised NMT architecture. The evaluated results are declared at the LoResMT 2020 shared task, which reports that our system achieves the bilingual evaluation understudy (BLEU) score of 0.59, precision score of 3.43, recall score of 5.48, F-measure score of 4.22, and rank-based intuitive bilingual evaluation score (RIBES) of 0.180147 in Russian to Hindi translation. And for Hindi to Russian translation, we have achieved BLEU, precision, recall, F-measure, and RIBES score of 1.11, 4.72, 4.41, 4.56, and 0.026842 respectively.

pdf
EnAsCorp1.0: English-Assamese Corpus
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

The corpus preparation is one of the important challenging task for the domain of machine translation especially in low resource language scenarios. Country like India where multiple languages exists, machine translation attempts to minimize the communication gap among people with different linguistic backgrounds. Although Google Translation covers automatic translation of various languages all over the world but it lags in some languages including Assamese. In this paper, we have developed EnAsCorp1.0, corpus of English-Assamese low resource pair where parallel and monolingual data are collected from various online sources. We have also implemented baseline systems with statistical machine translation and neural machine translation approaches for the same corpus.

pdf
Hindi-Marathi Cross Lingual Model
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the Fifth Conference on Machine Translation

Machine Translation (MT) is a vital tool for aiding communication between linguistically separate groups of people. The neural machine translation (NMT) based approaches have gained widespread acceptance because of its outstanding performance. We have participated in WMT20 shared task of similar language translation on Hindi-Marathi pair. The main challenge of this task is by utilization of monolingual data and similarity features of similar language pair to overcome the limitation of available parallel data. In this work, we have implemented NMT based model that simultaneously learns bilingual embedding from both the source and target language pairs. Our model has achieved Hindi to Marathi bilingual evaluation understudy (BLEU) score of 11.59, rank-based intuitive bilingual evaluation score (RIBES) score of 57.76 and translation edit rate (TER) score of 79.07 and Marathi to Hindi BLEU score of 15.44, RIBES score of 61.13 and TER score of 75.96.

2019

pdf
Neural Machine Translation: Hindi-Nepali
Sahinur Rahman Laskar | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

With the extensive use of Machine Translation (MT) technology, there is progressively interest in directly translating between pairs of similar languages. Because the main challenge is to overcome the limitation of available parallel data to produce a precise MT output. Current work relies on the Neural Machine Translation (NMT) with attention mechanism for the similar language translation of WMT19 shared task in the context of Hindi-Nepali pair. The NMT systems trained the Hindi-Nepali parallel corpus and tested, analyzed in Hindi ⇔ Nepali translation. The official result declared at WMT19 shared task, which shows that our NMT system obtained Bilingual Evaluation Understudy (BLEU) score 24.6 for primary configuration in Nepali to Hindi translation. Also, we have achieved BLEU score 53.7 (Hindi to Nepali) and 49.1 (Nepali to Hindi) in contrastive system type.

pdf
English to Hindi Multi-modal Neural Machine Translation and Hindi Image Captioning
Sahinur Rahman Laskar | Rohit Pratap Singh | Partha Pakray | Sivaji Bandyopadhyay
Proceedings of the 6th Workshop on Asian Translation

With the widespread use of Machine Trans-lation (MT) techniques, attempt to minimizecommunication gap among people from di-verse linguistic backgrounds. We have par-ticipated in Workshop on Asian Transla-tion 2019 (WAT2019) multi-modal translationtask. There are three types of submissiontrack namely, multi-modal translation, Hindi-only image captioning and text-only transla-tion for English to Hindi translation. The mainchallenge is to provide a precise MT output. The multi-modal concept incorporates textualand visual features in the translation task. Inthis work, multi-modal translation track re-lies on pre-trained convolutional neural net-works (CNN) with Visual Geometry Grouphaving 19 layered (VGG19) to extract imagefeatures and attention-based Neural MachineTranslation (NMT) system for translation. The merge-model of recurrent neural network(RNN) and CNN is used for the Hindi-onlyimage captioning. The text-only translationtrack is based on the transformer model of theNMT system. The official results evaluated atWAT2019 translation task, which shows thatour multi-modal NMT system achieved Bilin-gual Evaluation Understudy (BLEU) score20.37, Rank-based Intuitive Bilingual Eval-uation Score (RIBES) 0.642838, Adequacy-Fluency Metrics (AMFM) score 0.668260 forchallenge test data and BLEU score 40.55,RIBES 0.760080, AMFM score 0.770860 forevaluation test data in English to Hindi multi-modal translation respectively.