Thoudam Doren Singh

Also published as: Thoudam Doren Singh

2023

pdf abs
Sentiment Analysis for the Mizo Language: A Comparative Study of Classical Machine Learning and Transfer Learning Approaches
Mercy Lalthangmawii | Thoudam Doren Singh
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Sentiment analysis, a subfield of natural language processing (NLP) has witnessed significant advancements in the analysis of usergenerated contents across diverse languages. However, its application to low-resource languages remains a challenge. This research addresses this gap by conducting a comprehensive sentiment analysis experiment in the context of the Mizo language, a low-resource language predominantly spoken in the Indian state of Mizoram and neighboring regions. Our study encompasses the evaluation of various machine learning models including Support Vector Machine (SVM), Decision Tree, Random Forest, K-Nearest Neighbor (K-NN), Logistic Regression and transfer learning using XLM-RoBERTa. The findings reveal the suitability of SVM as a robust performer in Mizo sentiment analysis demonstrating the highest F1 Score and Accuracy among the models tested. XLM-RoBERTa, a transfer learning model exhibits competitive performance highlighting the potential of leveraging pre-trained multilingual models in low-resource language sentiment analysis tasks. This research advances our understanding of sentiment analysis in lowresource languages and serves as a stepping stone for future investigations in this domain.

pdf abs
A comparative study of transformer and transfer learning MT models for English-Manipuri
Kshetrimayum Boynao Singh | Ningthoujam Avichandra Singh | Loitongbam Sanayai Meetei | Ningthoujam Justwant Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

In this work, we focus on the development of machine translation (MT) models of a lowresource language pair viz. English-Manipuri. Manipuri is one of the eight scheduled languages of the Indian constitution. Manipuri is currently written in two different scripts: one is its original script called Meitei Mayek and the other is the Bengali script. We evaluate the performance of English-Manipuri MT models based on transformer and transfer learning technique. Our MT models are trained using a dataset of 69,065 parallel sentences and validated on 500 sentences. Using 500 test sentences, the English to Manipuri MT models achieved a BLEU score of 19.13 and 29.05 with mT5 and OpenNMT respectively. The results demonstrate that the OpenNMT model significantly outperforms the mT5 model. Additionally, Manipuri to English MT system trained with OpenNMT model reported a BLEU score of 30.90. We also carried out a comparative analysis between the Bengali script and the transliterated Meitei Mayek script for English-Manipuri MT models. This analysis reveals that the transliterated version enhances the MT model performance resulting in a notable +2.35 improvement in the BLEU score.

pdf abs
NITS-CNLP Low-Resource Neural Machine Translation Systems of English-Manipuri Language Pair
Kshetrimayum Boynao Singh | Avichandra Singh Ningthoujam | Loitongbam Sanayai Meetei | Sivaji Bandyopadhyay | Thoudam Doren Singh
Proceedings of the Eighth Conference on Machine Translation

This paper describes the transformer-based Neural Machine translation (NMT) system for the Low-Resource Indic Language Translation task for the English-Manipuri language pair submitted by the Centre for Natural Language Processing in National Institute of Technology Silchar, India (NITS-CNLP) in the WMT 2023 shared task. The model attained an overall BLEU score of 22.75 and 26.92 for the English to Manipuri and Manipuri to English translations respectively. Experimental results for English to Manipuri and Manipuri to English models for character level n-gram F-score (chrF) of 48.35 and 48.64, RIBES of 0.61 and 0.65, TER of 70.02 and 67.62, as well as COMET of 0.70 and 0.66 respectively are reported.

2021

pdf abs
An Experiment on Speech-to-Text Translation Systems for Manipuri to English on Low Resource Setting
Loitongbam Sanayai Meetei | Laishram Rahul | Alok Singh | Salam Michael Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

In this paper, we report the experimental findings of building Speech-to-Text translation systems for Manipuri-English on low resource setting which is first of its kind in this language pair. For this purpose, a new dataset consisting of a Manipuri-English parallel corpus along with the corresponding audio version of the Manipuri text is built. Based on this dataset, a benchmark evaluation is reported for the Manipuri-English Speech-to-Text translation using two approaches: 1) a pipeline model consisting of ASR (Automatic Speech Recognition) and Machine translation, and 2) an end-to-end Speech-to-Text translation. Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) and Time delay neural network (TDNN) Acoustic models are used to build two different pipeline systems using a shared MT system. Experimental result shows that the TDNN model outperforms GMM-HMM model significantly by a margin of 2.53% WER. However, their evaluation of Speech-to-Text translation differs by a small margin of 0.1 BLEU. Both the pipeline translation models outperform the end-to-end translation model by a margin of 2.6 BLEU score.

pdf abs
On the Transferability of Massively Multilingual Pretrained Models in the Pretext of the Indo-Aryan and Tibeto-Burman Languages
Salam Michael Singh | Loitongbam Sanayai Meetei | Alok Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

In recent times, machine translation models can learn to perform implicit bridging between language pairs never seen explicitly during training and showing that transfer learning helps for languages with constrained resources. This work investigates the low resource machine translation via transfer learning from multilingual pre-trained models i.e. mBART-50 and mT5-base in the pretext of Indo-Aryan (Assamese and Bengali) and Tibeto-Burman (Manipuri) languages via finetuning as a downstream task. Assamese and Manipuri were absent in the pretraining of both mBART-50 and the mT5 models. However, the experimental results attest that the finetuning from these pre-trained models surpasses the multilingual model trained from scratch.

pdf abs
Image Caption Generation Framework for Assamese News using Attention Mechanism
Ringki Das | Thoudam Doren Singh
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Automatic caption generation is an artificial intelligence problem that falls at the intersection of computer vision and natural language processing. Although significant works have been reported in image captioning, the contribution is limited to English and few major languages with sufficient resources. But, no work on image captioning has been reported in a resource-constrained language like Assamese. With this inspiration, we propose an encoder-decoder based framework for image caption generation in the Assamese news domain. The VGG-16 pre-trained model at the encoder side and LSTM with an attention mechanism are employed at the decoder side to generate the Assamese caption. We train the proposed model on the dataset built in-house consisting of 10,000 images with a single caption for each image. We describe our experimental methodology, quantitative and qualitative results which validate the effectiveness of our model for caption generation. The proposed model shows a BLEU score of 12.1 outperforming the baseline model.

pdf abs
An Efficient Keyframes Selection Based Framework for Video Captioning
Alok Singh | Loitongbam Sanayai Meetei | Salam Michael Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Describing a video is a challenging yet attractive task since it falls into the intersection of computer vision and natural language generation. The attention-based models have reported the best performance. However, all these models follow similar procedures, such as segmenting videos into chunks of frames or sampling frames at equal intervals for visual encoding. The process of segmenting video into chunks or sampling frames at equal intervals causes encoding of redundant visual information and requires additional computational cost since a video consists of a sequence of similar frames and suffers from inescapable noise such as uneven illumination, occlusion and motion effects. In this paper, a boundary-based keyframes selection approach for video description is proposed that allow the system to select a compact subset of keyframes to encode the visual information and generate a description for a video without much degradation. The proposed approach uses 3 4 frames per video and yields competitive performance over two benchmark datasets MSVD and MSR-VTT (in both English and Hindi).

pdf bib
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)
Thoudam Doren Singh | Cristina España i Bonet | Sivaji Bandyopadhyay | Josef van Genabith
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

pdf bib abs
Multiple Captions Embellished Multilingual Multi-Modal Neural Machine Translation
Salam Michael Singh | Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

Neural machine translation based on bilingual text with limited training data suffers from lexical diversity, which lowers the rare word translation accuracy and reduces the generalizability of the translation system. In this work, we utilise the multiple captions from the Multi-30K dataset to increase the lexical diversity aided with the cross-lingual transfer of information among the languages in a multilingual setup. In this multilingual and multimodal setting, the inclusion of the visual features boosts the translation quality by a significant margin. Empirical study affirms that our proposed multimodal approach achieves substantial gain in terms of the automatic score and shows robustness in handling the rare word translation in the pretext of English to/from Hindi and Telugu translation tasks.

pdf abs
Low Resource Multimodal Neural Machine Translation of English-Hindi in News Domain
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

Incorporating multiple input modalities in a machine translation (MT) system is gaining popularity among MT researchers. Unlike the publicly available dataset for Multimodal Machine Translation (MMT) tasks, where the captions are short image descriptions, the news captions provide a more detailed description of the contents of the images. As a result, numerous named entities relating to specific persons, locations, etc., are found. In this paper, we acquire two monolingual news datasets reported in English and Hindi paired with the images to generate a synthetic English-Hindi parallel corpus. The parallel corpus is used to train the English-Hindi Neural Machine Translation (NMT) and an English-Hindi MMT system by incorporating the image feature paired with the corresponding parallel corpus. We also conduct a systematic analysis to evaluate the English-Hindi MT systems with 1) more synthetic data and 2) by adding back-translated data. Our finding shows improvement in terms of BLEU scores for both the NMT (+8.05) and MMT (+11.03) systems.

2020

pdf abs
The NITS-CNLP System for the Unsupervised MT Task at WMT 2020
Salam Michael Singh | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the Fifth Conference on Machine Translation

We describe NITS-CNLP’s submission to WMT 2020 unsupervised machine translation shared task for German language (de) to Upper Sorbian (hsb) in a constrained setting i.e, using only the data provided by the organizers. We train our unsupervised model using monolingual data from both the languages by jointly pre-training the encoder and decoder and fine-tune using backtranslation loss. The final model uses the source side (de) monolingual data and the target side (hsb) synthetic data as a pseudo-parallel data to train a pseudo-supervised system which is tuned using the provided development set(dev set).

pdf abs
Unsupervised Neural Machine Translation for English and Manipuri
Salam Michael Singh | Thoudam Doren Singh
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Availability of bitext dataset has been a key challenge in the conventional machine translation system which requires surplus amount of parallel data. In this work, we devise an unsupervised neural machine translation (UNMT) system consisting of a transformer based shared encoder and language specific decoders using denoising autoencoder and backtranslation with an additional Manipuri side multiple test reference. We report our work on low resource setting for English (en) - Manipuri (mni) language pair and attain a BLEU score of 3.1 for en-mni and 2.7 for mni-en respectively. Subjective evaluation on translated output gives encouraging findings.

pdf abs
NITS-Hinglish-SentiMix at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text Using an Ensemble Model
Subhra Jyoti Baroi | Nivedita Singh | Ringki Das | Thoudam Doren Singh
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Sentiment Analysis refers to the process of interpreting what a sentence emotes and classifying them as positive, negative, or neutral. The widespread popularity of social media has led to the generation of a lot of text data and specifically, in the Indian social media scenario, the code-mixed Hinglish text i.e, the words of Hindi language, written in the Roman script along with other English words is a common sight. The ability to effectively understand the sentiments in these texts is much needed. This paper proposes a system titled NITS-Hinglish to effectively carry out the sentiment analysis of such code-mixed Hinglish text. The system has fared well with a final F-Score of 0.617 on the test data.

pdf abs
English to Manipuri and Mizo Post-Editing Effort and its Impact on Low Resource Machine Translation
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay | Mihaela Vela | Josef van Genabith
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

We present the first study on the post-editing (PE) effort required to build a parallel dataset for English-Manipuri and English-Mizo, in the context of a project on creating data for machine translation (MT). English source text from a local daily newspaper are machine translated into Manipuri and Mizo using PBSMT systems built in-house. A Computer Assisted Translation (CAT) tool is used to record the time, keystroke and other indicators to measure PE effort in terms of temporal and technical effort. A positive correlation between the technical effort and the number of function words is seen for English-Manipuri and English-Mizo but a negative correlation between the technical effort and the number of noun words for English-Mizo. However, average time spent per token in PE English-Mizo text is negatively correlated with the temporal effort. The main reason for these results are due to (i) English and Mizo using the same script, while Manipuri uses a different script and (ii) the agglutinative nature of Manipuri. Further, we check the impact of training a MT system in an incremental approach, by including the post-edited dataset as additional training data. The result shows an increase in HBLEU of up to 4.6 for English-Manipuri.

2019

pdf abs
WAT2019: English-Hindi Translation on Hindi Visual Genome Dataset
Loitongbam Sanayai Meetei | Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 6th Workshop on Asian Translation

A multimodal translation is a task of translating a source language to a target language with the help of a parallel text corpus paired with images that represent the contextual details of the text. In this paper, we carried out an extensive comparison to evaluate the benefits of using a multimodal approach on translating text in English to a low resource language, Hindi as a part of WAT2019 shared task. We carried out the translation of English to Hindi in three separate tasks with both the evaluation and challenge dataset. First, by using only the parallel text corpora, then through an image caption generation approach and, finally with the multimodal approach. Our experiment shows a significant improvement in the result with the multimodal approach than the other approach.

The performance of an SMT system heavily depends on the availability of large parallel corpora. Unavailability of these resources in the required amount for many language pair is a challenging issue. The required size of the resource involving morphologically rich and highly agglutinative language is essentially much more for the SMT systems. This paper investigates on some of the issues on enriching the resource for this kind of languages. Handling of inflectional and derivational morphemes of the morphologically rich target language plays important role in the enrichment process. Mapping from the source to the target side is carried out for the English-Manipuri SMT task using factored model. The SMT system developed shows improvement in the performance both in terms of the automatic scoring and subjective evaluation over the baseline system.

2011

pdf
Integration of Reduplicated Multiword Expressions and Named Entities in a Phrase Based Statistical Machine Translation System
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib abs
Statistical Machine Translation of English-Manipuri using Morpho-syntactic and Semantic Information
Thoudam Doren Singh | Savaji Bandyopadhyay
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

English-Manipuri language pair is one of the rarely investigated with restricted bilingual resources. The development of a factored Statistical Machine Translation (SMT) system between English as source and Manipuri, a morphologically rich language as target is reported. The role of the suffixes and dependency relations on the source side and case markers on the target side are identified as important translation factors. The morphology and dependency relations play important roles to improve the translation quality. A parallel corpus of 10350 sentences from news domain is used for training and the system is tested with 500 sentences. Using the proposed translation factors, the output of the translation quality is improved as indicated by the BLEU score and subjective evaluation.

pdf
Web Based Manipuri Corpus for Multiword NER and Reduplicated MWEs Identification using SVM
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing

pdf
Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Dependency Relations
Thoudam Doren Singh | Sivaji Bandyopadhyay
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation