Sudip Kumar Naskar

Also published as: Sudip Kumar Naskar, Sudip Naskar

2024

pdf abs
Fine-tuning Language Models for Predicting the Impact of Events Associated to Financial News Articles
Neelabha Banerjee | Anubhav Sarkar | Swagata Chakraborty | Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing

Investors and other stakeholders like consumers and employees, increasingly consider ESG factors when making decisions about investments or engaging with companies. Taking into account the importance of ESG today, FinNLP-KDF introduced the ML-ESG-3 shared task, which seeks to determine the duration of the impact of financial news articles in four languages - English, French, Korean, and Japanese. This paper describes our team, LIPI’s approach towards solving the above-mentioned task. Our final systems consist of translation, paraphrasing and fine-tuning language models like BERT, Fin-BERT and RoBERTa for classification. We ranked first in the impact duration prediction subtask for French language.

pdf abs
IndicFinNLP: Financial Natural Language Processing for Indian Languages
Sohom Ghosh | Arnab Maji | Aswartha Narayana | Sudip Kumar Naskar
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Applications of Natural Language Processing (NLP) in the finance domain have been very popular of late. For financial NLP, (FinNLP) while various datasets exist for widely spoken languages like English and Chinese, datasets are scarce for low resource languages,particularly for Indian languages. In this paper, we address this challenges by presenting IndicFinNLP – a collection of 9 datasets consisting of three tasks relating to FinNLP for three Indian languages. These tasks are Exaggerated Numeral Detection, Sustainability Classification, and ESG Theme Determination of financial texts in Hindi, Bengali, and Telugu. Moreover, we release the datasets under CC BY-NC-SA 4.0 license for the benefit of the research community.

2023

pdf abs
Attentive Fusion: A Transformer-based Approach to Multimodal Hate Speech Detection
Atanu Mandal | Gargi Roy | Amit Barman | Indranil Dutta | Sudip Kumar Naskar
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called “Attentive Fusion”. The results of our study surpassed previous stateof-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set.

There is an evident lack of implementation of Machine Learning (ML) in the legal domain in India, and any research that does take place in this domain is usually based on data from the higher courts of law and works with English data. The lower courts and data from the different regional languages of India are often overlooked. In this paper, we deploy a Convolutional Neural Network (CNN) architecture on a corpus of Hindi legal documents. We perform a bail Prediction task with the help of a CNN model and achieve an overall accuracy of 93% which is an improvement on the benchmark accuracy, set by Kapoor et al. (2022), albeit in data from 20 districts of the Indian state of Uttar Pradesh.

pdf abs
IACS-LRILT: Machine Translation for Low-Resource Indic Languages
Dhairya Suman | Atanu Mandal | Santanu Pal | Sudip Naskar
Proceedings of the Eighth Conference on Machine Translation

Even though, machine translation has seen huge improvements in the the last decade, translation quality for Indic languages is still underwhelming, which is attributed to the small amount of parallel data available. In this paper, we present our approach to mitigate the issue of the low amount of parallel training data availability for Indic languages, especially for the language pair English-Manipuri and Assamese-English. Our primary submission for the Manipuri-to-English translation task provided the best scoring system for this language direction. We describe about the systems we built in detail and our findings in the process.

pdf abs
A low resource framework for Multi-lingual ESG Impact Type Identification
Harsha Vardhan | Sohom Ghosh | Ponnurangam Kumaraguru | Sudip Naskar
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

With the growing interest in Green Investing, Environmental, Social, and Governance (ESG) factors related to Institutions and financial entities has become extremely important for investors. While the classification of potential ESG factors is an important issue, identifying whether the factors positively or negatively impact the Institution is also a key aspect to consider while making evaluations for ESG scores. This paper presents our solution to identify ESG impact types in four languages (English, Chinese, Japanese, French) released as shared tasks during the FinNLP workshop at the IJCNLP-AACL-2023 conference. We use a combination of translation, masked language modeling, paraphrasing, and classification to solve this problem and use a generalized pipeline that performs well across all four languages. Our team ranked 1st in the Chinese and Japanese sub-tasks.

2022

pdf abs
LIPI at the FinNLP-2022 ERAI Task: Ensembling Sentence Transformers for Assessing Maximum Possible Profit and Loss from Online Financial Posts
Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Using insights from social media for making investment decisions has become mainstream. However, in the current era of information ex- plosion, it is essential to mine high-quality so- cial media posts. The FinNLP-2022 ERAI task deals with assessing Maximum Possible Profit (MPP) and Maximum Loss (ML) from social me- dia posts relating to finance. In this paper, we present our team LIPI’s approach. We ensem- bled a range of Sentence Transformers to quan- tify these posts. Unlike other teams with vary- ing performances across different metrics, our system performs consistently well. Our code is available here https://github.com/sohomghosh/LIPI_ERAI_ FinNLP_EMNLP- 2022/

pdf abs
Ranking Environment, Social And Governance Related Concepts And Assessing Sustainability Aspect of Financial Texts
Sohom Ghosh | Sudip Kumar Naskar
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Understanding Environmental, Social, and Governance (ESG) factors related to financial products has become extremely important for investors. However, manually screening through the corporate policies and reports to understand their sustainability aspect is extremely tedious. In this paper, we propose solutions to two such problems which were released as shared tasks of the FinNLP workshop of the IJCAI-2022 conference. Firstly, we train a Sentence Transformers based model which automatically ranks ESG related concepts for a given unknown term. Secondly, we fine-tune a RoBERTa model to classify financial texts as sustainable or not. Out of 26 registered teams, our team ranked 4th in sub-task 1 and 3rd in sub-task 2. The source code can be accessed from https://github.com/sohomghosh/Finsim4_ESG

pdf bib abs
FinRAD: Financial Readability Assessment Dataset - 13,000+ Definitions of Financial Terms for Measuring Readability
Sohom Ghosh | Shovon Sengupta | Sudip Naskar | Sunny Kumar Singh
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In today’s world, the advancement and spread of the Internet and digitalization have resulted in most information being openly accessible. This holds true for financial services as well. Investors make data driven decisions by analysing publicly available information like annual reports of listed companies, details regarding asset allocation of mutual funds, etc. Many a time these financial documents contain unknown financial terms. In such cases, it becomes important to look at their definitions. However, not all definitions are equally readable. Readability largely depends on the structure, complexity and constituent terms that make up a definition. This brings in the need for automatically evaluating the readability of definitions of financial terms. This paper presents a dataset, FinRAD consisting of financial terms, their definitions and embeddings. In addition to standard readability scores (like “Flesch Reading Index (FRI)”, “Automated Readability Index (ARI)”, “SMOG Index Score (SIS)”,“Dale-Chall formula (DCF)”, etc.), it also contains the readability scores (AR) assigned based on sources from which the terms have been collected. We manually inspect a sample from it to ensure the quality of the assignment. Subsequently, we prove that the rule-based standard readability scores (like “Flesch Reading Index (FRI)”, “Automated Readability Index (ARI)”, “SMOG Index Score (SIS)”,“Dale-Chall formula (DCF)”, etc.) do not correlate well with the manually assigned binary readability scores of definitions of financial terms. Finally, we present a few neural baselines using transformer based architecture to automatically classify these definitions as readable or not. Pre-trained FinBERT model fine-tuned on FinRAD corpus performs the best (AU-ROC = 0.9927, F1 = 0.9610). This corpus can be downloaded from https://github.com/sohomghosh/FinRAD_Financial_Readability_Assessment_Dataset.

pdf abs
LIPI at FinCausal 2022: Mining Causes and Effects from Financial Texts
Sohom Ghosh | Sudip Naskar
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

While reading financial documents, investors need to know the causes and their effects. This empowers them to make data-driven decisions. Thus, there is a need to develop an automated system for extracting causes and their effects from financial texts using Natural Language Processing. In this paper, we present the approach our team LIPI followed while participating in the FinCausal 2022 shared task. This approach is based on the winning solution of the first edition of FinCausal held in the year 2020.

pdf abs
A Novel Approach towards Cross Lingual Sentiment Analysis using Transliteration and Character Embedding
Rajarshi Roychoudhury | Subhrajit Dey | Md Akhtar | Amitava Das | Sudip Naskar
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Sentiment analysis with deep learning in resource-constrained languages is a challenging task. In this paper, we introduce a novel approach for sentiment analysis in resource-constrained scenarios using character embedding and cross-lingual sentiment analysis with transliteration. We use this method to introduce the novel task of inducing sentiment polarity of words and sentences and aspect term sentiment analysis in the no-resource scenario. We formulate this task by taking a metalingual approach whereby we transliterate data from closely related languages and transform it into a meta language. We also demonstrated the efficacy of using character-level embedding for sentence representation. We experimented with 4 Indian languages – Bengali, Hindi, Tamil, and Telugu, and obtained encouraging results. We also presented new state-of-the-art results on the Hindi sentiment analysis dataset leveraging our metalingual character embeddings.

2021

pdf abs
Fine-tuning BERT to classify COVID19 tweets containing symptoms
Rajarshi Roychoudhury | Sudip Naskar
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

Twitter is a valuable source of patient-generated data that has been used in various population health studies. The first step in many of these studies is to identify and capture Twitter messages (tweets) containing medication mentions. Identifying personal mentions of COVID19 symptoms requires distinguishing personal mentions from other mentions such as symptoms reported by others and references to news articles or other sources. In this article, we describe our submission to Task 6 of the Social Media Mining for Health Applications (SMM4H) Shared Task 2021. This task challenged participants to classify tweets where the target classes are:(1) self-reports,(2) non-personal reports, and (3) literature/news mentions. Our system used a handcrafted preprocessing and word embeddings from BERT encoder model. We achieved an F1 score of 93%

pdf abs
An Efficient BERT Based Approach to Detect Aggression and Misogyny
Sandip Dutta | Utso Majumder | Sudip Naskar
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Social media is bustling with ever growing cases of trolling, aggression and hate. A huge amount of social media data is generated each day which is insurmountable for manual inspection. In this work, we propose an efficient and fast method to detect aggression and misogyny in social media texts. We use data from the Second Workshop on Trolling, Aggression and Cyber Bullying for our task. We employ a BERT based model to augment our data. Next we employ Tf-Idf and XGBoost for detecting aggression and misogyny. Our model achieves 0.73 and 0.85 Weighted F1 Scores on the 2 prediction tasks, which are comparable to the state of the art. However, the training time, model size and resource requirements of our model are drastically lower compared to the state of the art models, making our model useful for fast inference.

pdf abs
FinRead: A Transfer Learning Based Tool to Assess Readability of Definitions of Financial Terms
Sohom Ghosh | Shovon Sengupta | Sudip Naskar | Sunny Kumar Singh
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Simplified definitions of complex terms help learners to understand any content better. Comprehending readability is critical for the simplification of these contents. In most cases, the standard formula based readability measures do not hold good for measuring the complexity of definitions of financial terms. Furthermore, some of them works only for corpora of longer length which have at least 30 sentences. In this paper, we present a tool for evaluating readability of definitions of financial terms. It consists of a Light GBM based classification layer over sentence embeddings (Reimers et al., 2019) of FinBERT (Araci, 2019). It is trained on glossaries of several financial textbooks and definitions of various financial terms which are available on the web. The extensive evaluation shows that it outperforms the standard benchmarks by achieving a AU-ROC score of 0.993 on the validation set.

pdf abs
Sdutta at ComMA@ICON: A CNN-LSTM Model for Hate Detection
Sandip Dutta | Utso Majumder | Sudip Naskar
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

In today’s world, online activity and social media are facing an upsurge of cases of aggression, gender-biased comments and communal hate. In this shared task, we used a CNN-LSTM hybrid method to detect aggression, misogynistic and communally charged content in social media texts. First, we employ text cleaning and convert the text into word embeddings. Next we proceed to our CNN-LSTM based model to predict the nature of the text. Our model achieves 0.288, 0.279, 0.294 and 0.335 Overall Micro F1 Scores in multilingual, Meitei, Bengali and Hindi datasets, respectively, on the 3 prediction labels.

pdf abs
JUNLP@DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Langauges
Avishek Garain | Atanu Mandal | Sudip Kumar Naskar
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Offensive language identification has been an active area of research in natural language processing. With the emergence of multiple social media platforms offensive language identification has emerged as a need of the hour. Traditional offensive language identification models fail to deliver acceptable results as social media contents are largely in multilingual and are code-mixed in nature. This paper tries to resolve this problem by using IndicBERT and BERT architectures, to facilitate identification of offensive languages for Kannada-English, Malayalam-English, and Tamil-English code-mixed language pairs extracted from social media. The presented approach when evaluated on the test corpus provided precision, recall, and F1 score for language pair Kannada-English as 0.62, 0.71, and 0.66, respectively, for language pair Malayalam-English as 0.77, 0.43, and 0.53, respectively, and for Tamil-English as 0.71, 0.74, and 0.72, respectively.

2020

In automatic post-editing (APE) it makes sense to condition post-editing (pe) decisions on both the source (src) and the machine translated text (mt) as input. This has led to multi-encoder based neural APE approaches. A research challenge now is the search for architectures that best support the capture, preparation and provision of src and mt information and its integration with pe decisions. In this paper we present an efficient multi-encoder based APE model, called transference. Unlike previous approaches, it (i) uses a transformer encoder block for src, (ii) followed by a decoder block, but without masking for self-attention on mt, which effectively acts as second encoder combining src –> mt, and (iii) feeds this representation into a final decoder block generating pe. Our model outperforms the best performing systems by 1 BLEU point on the WMT 2016, 2017, and 2018 English–German APE shared tasks (PBSMT and NMT). Furthermore, the results of our model on the WMT 2019 APE task using NMT data shows a comparable performance to the state-of-the-art system. The inference time of our model is similar to the vanilla transformer-based NMT system although our model deals with two separate encoders. We further investigate the importance of our newly introduced second encoder and find that a too small amount of layers does hurt the performance, while reducing the number of layers of the decoder does not matter much.

pdf abs
Spyder: Aggression Detection on Multilingual Tweets
Anisha Datta | Shukrity Si | Urbi Chakraborty | Sudip Kumar Naskar
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In the last few years, hate speech and aggressive comments have covered almost all the social media platforms like facebook, twitter etc. As a result hatred is increasing. This paper describes our (Team name: Spyder) participation in the Shared Task on Aggression Detection organised by TRAC-2, Second Workshop on Trolling, Aggression and Cyberbullying. The Organizers provided datasets in three languages – English, Hindi and Bengali. The task was to classify each instance of the test sets into three categories – “Overtly Aggressive” (OAG), “Covertly Aggressive” (CAG) and “Non-Aggressive” (NAG). In this paper, we propose three different models using Tf-Idf, sentiment polarity and machine learning based classifiers. We obtained f1 score of 43.10%, 59.45% and 44.84% respectively for English, Hindi and Bengali.

pdf abs
A New Approach to Claim Check-Worthiness Prediction and Claim Verification
Shukrity Si | Anisha Datta | Sudip Naskar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

The more we are advancing towards a modern world, the more it opens the path to falsification in every aspect of life. Even in case of knowing the surrounding, common people can not judge the actual scenario as the promises, comments and opinions of the influential people at power keep changing every day. Therefore computationally determining the truthfulness of such claims and comments has a very important societal impact. This paper describes a unique method to extract check-worthy claims from the 2016 US presidential debates and verify the truthfulness of the check-worthy claims. We classify the claims for check-worthiness with our modified Tf-Idf model which is used in background training on fact-checking news articles (NBC News and Washington Post). We check the truthfulness of the claims by using POS, sentiment score and cosine similarity features.

pdf abs
A Rule Based Lightweight Bengali Stemmer
Souvick Das | Rajat Pandit | Sudip Kumar Naskar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

In the field of Natural Language Processing (NLP) the process of stemming plays a significant role. Stemmer transforms an inflected word to its root form. Stemmer significantly increases the efficiency of Information Retrieval (IR) systems. It is a very basic yet fundamental text pre-processing task widely used in many NLP tasks. Several important works on stemming have been carried out by researchers in English and other major languages. In this paper, we study and review existing works on stemming in Bengali and other Indian languages. Finally, we propose a rule based approach that explores Bengali morphology and leverages WordNet to achieve better accuracy. Our algorithm produced stemming accuracy of 98.86% for Nouns and 99.75% for Verbs.

pdf abs
Deep Neural Model for Manipuri Multiword Named Entity Recognition with Unsupervised Cluster Feature
Jimmy Laishram | Kishorjit Nongmeikapam | Sudip Naskar
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

The recognition task of Multi-Word Named Entities (MNEs) in itself is a challenging task when the language is inflectional and agglutinative. Having breakthrough NLP researches with deep neural network and language modelling techniques, the applicability of such techniques/algorithms for Indian language like Manipuri remains unanswered. In this paper an attempt to recognize Manipuri MNE is performed using a Long Short Term Memory (LSTM) recurrent neural network model in conjunction with Part Of Speech (POS) embeddings. To further improve the classification accuracy, word cluster information using K-means clustering approach is added as a feature embedding. The cluster information is generated using a Skip-gram based words vector that contains the semantic and syntactic information of each word. The model so proposed does not use extensive language morphological features to elevate its accuracy. Finally the model’s performance is compared with the other machine learning based Manipuri MNE models.

2019

pdf abs
JU_ETCE_17_21 at SemEval-2019 Task 6: Efficient Machine Learning and Neural Network Approaches for Identifying and Categorizing Offensive Language in Tweets
Preeti Mukherjee | Mainak Pal | Somnath Banerjee | Sudip Kumar Naskar
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system submissions as part of our participation (team name: JU_ETCE_17_21) in the SemEval 2019 shared task 6: “OffensEval: Identifying and Catego- rizing Offensive Language in Social Media”. We participated in all the three sub-tasks: i) Sub-task A: offensive language identification, ii) Sub-task B: automatic categorization of of- fense types, and iii) Sub-task C: offense target identification. We employed machine learn- ing as well as deep learning approaches for the sub-tasks. We employed Convolutional Neural Network (CNN) and Recursive Neu- ral Network (RNN) Long Short-Term Memory (LSTM) with pre-trained word embeddings. We used both word2vec and Glove pre-trained word embeddings. We obtained the best F1- score using CNN based model for sub-task A, LSTM based model for sub-task B and Lo- gistic Regression based model for sub-task C. Our best submissions achieved 0.7844, 0.5459 and 0.48 F1-scores for sub-task A, sub-task B and sub-task C respectively.

pdf abs
JU-Saarland Submission to the WMT2019 English–Gujarati Translation Shared Task
Riktim Mondal | Shankha Raj Nayek | Aditya Chowdhury | Santanu Pal | Sudip Kumar Naskar | Josef van Genabith
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we describe our joint submission (JU-Saarland) from Jadavpur University and Saarland University in the WMT 2019 news translation shared task for English–Gujarati language pair within the translation task sub-track. Our baseline and primary submissions are built using Recurrent neural network (RNN) based neural machine translation (NMT) system which follows attention mechanism. Given the fact that the two languages belong to different language families and there is not enough parallel data for this language pair, building a high quality NMT system for this language pair is a difficult task. We produced synthetic data through back-translation from available monolingual data. We report the translation quality of our English–Gujarati and Gujarati–English NMT systems trained at word, byte-pair and character encoding levels where RNN at word level is considered as the baseline and used for comparison purpose. Our English–Gujarati system ranked in the second position in the shared task.

pdf bib
Improving CAT Tools in the Translation Workflow: New Approaches and Evaluation
Mihaela Vela | Santanu Pal | Marcos Zampieri | Sudip Naskar | Josef van Genabith
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf abs
ITER: Improving Translation Edit Rate through Optimizable Edit Costs
Joybrata Panja | Sudip Kumar Naskar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper presents our participation in the WMT 2018 Metrics Shared Task. We propose an improved version of Translation Edit/Error Rate (TER). In addition to including the basic edit operations in TER, namely - insertion, deletion, substitution and shift, our metric also allows stem matching, optimizable edit costs and better normalization so as to correlate better with human judgement scores. The proposed metric shows much higher correlation with human judgments than TER.

pdf abs
Keep It or Not: Word Level Quality Estimation for Post-Editing
Prasenjit Basu | Santanu Pal | Sudip Kumar Naskar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

The paper presents our participation in the WMT 2018 shared task on word level quality estimation (QE) of machine translated (MT) text, i.e., to predict whether a word in MT output for a given source context is correctly translated and hence should be retained in the post-edited translation (PE), or not. To perform the QE task, we measure the similarity of the source context of the target MT word with the context for which the word is retained in PE in the training data. This is achieved in two different ways, using Bag-of-Words (BoW) model and Document-to-Vector (Doc2Vec) model. In the BoW model, we compute the cosine similarity while in the Doc2Vec model we consider the Doc2Vec similarity. By applying the Kneedle algorithm on the F1mult vs. similarity score plot, we derive the threshold based on which OK/BAD decisions are taken for the MT words. Experimental results revealed that the Doc2Vec model performs better than the BoW model on the word level QE task.

2017

pdf abs
Neural Automatic Post-Editing Using Prior Alignment and Reranking
Santanu Pal | Sudip Kumar Naskar | Mihaela Vela | Qun Liu | Josef van Genabith
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a second-stage machine translation (MT) system based on a neural machine translation (NMT) approach to automatic post-editing (APE) that improves the translation quality provided by a first-stage MT system. Our APE system (APE_Sym) is an extended version of an attention based NMT model with bilingual symmetry employing bidirectional models, mt–pe and pe–mt. APE translations produced by our system show statistically significant improvements over the first-stage MT, phrase-based APE and the best reported score on the WMT 2016 APE dataset by a previous neural APE system. Re-ranking (APE_Rerank) of the n-best translations from the phrase-based APE and APE_Sym systems provides further substantial improvements over the symmetric neural APE model. Human evaluation confirms that the APE_Rerank generated PE translations improve on the previous best neural APE system at WMT 2016.

pdf
Natural Language Programing with Automatic Code Generation towards Solving Addition-Subtraction Word Problems
Sourav Mandal | Sudip Kumar Naskar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Unsupervised Morpheme Segmentation Through Numerical Weighting and Thresholding
Joy Mahapatra | Sudip Kumar Naskar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Normalization of Social Media Text using Deep Neural Networks
Ajay Shankar Tiwari | Sudip Kumar Naskar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf
Biomolecular Event Extraction using a Stacked Generalization based Classifier
Amit Majumder | Asif Ekbal | Sudip Kumar Naskar
Proceedings of the 13th International Conference on Natural Language Processing

pdf
Statistical Natural Language Generation from Tabular Non-textual Data
Joy Mahapatra | Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of the 9th International Natural Language Generation conference

pdf abs
CATaLog Online: Porting a Post-editing Tool to the Web
Santanu Pal | Marcos Zampieri | Sudip Kumar Naskar | Tapas Nayak | Mihaela Vela | Josef van Genabith
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents CATaLog online, a new web-based MT and TM post-editing tool. CATaLog online is a freeware software that can be used through a web browser and it requires only a simple registration. The tool features a number of editing and log functions similar to the desktop version of CATaLog enhanced with several new features that we describe in detail in this paper. CATaLog online is designed to allow users to post-edit both translation memory segments as well as machine translation output. The tool provides a complete set of log information currently not available in most commercial CAT tools. Log information can be used both for project management purposes as well as for the study of the translation process and translator’s productivity.

pdf abs
Multi-Engine and Multi-Alignment Based Automatic Post-Editing and its Impact on Translation Productivity
Santanu Pal | Sudip Kumar Naskar | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper we combine two strands of machine translation (MT) research: automatic post-editing (APE) and multi-engine (system combination) MT. APE systems learn a target-language-side second stage MT system from the data produced by human corrected output of a first stage MT system, to improve the output of the first stage MT in what is essentially a sequential MT system combination architecture. At the same time, there is a rich research literature on parallel MT system combination where the same input is fed to multiple engines and the best output is selected or smaller sections of the outputs are combined to obtain improved translation output. In the paper we show that parallel system combination in the APE stage of a sequential MT-APE combination yields substantial translation improvements both measured in terms of automatic evaluation metrics as well as in terms of productivity improvements measured in a post-editing experiment. We also show that system combination on the level of APE alignments yields further improvements. Overall our APE system yields statistically significant improvement of 5.9% relative BLEU over a strong baseline (English–Italian Google MT) and 21.76% productivity increase in a human post-editing experiment with professional translators.

pdf abs
CATaLog Online: A Web-based CAT Tool for Distributed Translation with Data Capture for APE and Translation Process Research
Santanu Pal | Sudip Kumar Naskar | Marcos Zampieri | Tapas Nayak | Josef van Genabith
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

We present a free web-based CAT tool called CATaLog Online which provides a novel and user-friendly online CAT environment for post-editors/translators. The goal is to support distributed translation, reduce post-editing time and effort, improve the post-editing experience and capture data for incremental MT/APE (automatic post-editing) and translation process research. The tool supports individual as well as batch mode file translation and provides translations from three engines – translation memory (TM), MT and APE. TM suggestions are color coded to accelerate the post-editing task. The users can integrate their personal TM/MT outputs. The tool remotely monitors and records post-editing activities generating an extensive range of post-editing logs.

pdf
A Neural Network based Approach to Automatic Post-Editing
Santanu Pal | Sudip Kumar Naskar | Mihaela Vela | Josef van Genabith
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf
UdS-Sant: English–German Hybrid Machine Translation System
Santanu Pal | Sudip Naskar | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf
USAAR-SAPE: An English–Spanish Statistical Automatic Post-Editing System
Santanu Pal | Mihaela Vela | Sudip Kumar Naskar | Josef van Genabith
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

pdf abs
Word Alignment-Based Reordering of Source Chunks in PB-SMT
Santanu Pal | Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Reordering poses a big challenge in statistical machine translation between distant language pairs. The paper presents how reordering between distant language pairs can be handled efficiently in phrase-based statistical machine translation. The problem of reordering between distant languages has been approached with prior reordering of the source text at chunk level to simulate the target language ordering. Prior reordering of the source chunks is performed in the present work by following the target word order suggested by word alignment. The testset is reordered using monolingual MT trained on source and reordered source. This approach of prior reordering of the source chunks was compared with pre-ordering of source words based on word alignments and the traditional approach of prior source reordering based on language-pair specific reordering rules. The effects of these reordering approaches were studied on an English–Bengali translation task, a language pair with different word order. From the experimental results it was found that word alignment based reordering of the source chunks is more effective than the other reordering approaches, and it produces statistically significant improvements over the baseline system on BLEU. On manual inspection we found significant improvements in terms of word alignments.

pdf
Automatic Building and Using Parallel Resources for SMT from Comparable Corpora
Santanu Pal | Partha Pakray | Sudip Kumar Naskar
Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra)

pdf abs
Perception vs. reality: measuring machine translation post-editing productivity
Federico Gaspari | Antonio Toral | Sudip Kumar Naskar | Declan Groves | Andy Way
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

This paper presents a study of user-perceived vs real machine translation (MT) post-editing effort and productivity gains, focusing on two bidirectional language pairs: English—German and English—Dutch. Twenty experienced media professionals post-edited statistical MT output and also manually translated comparative texts within a production environment. The paper compares the actual post-editing time against the users’ perception of the effort and time required to post-edit the MT output to achieve publishable quality, thus measuring real (vs perceived) productivity gains. Although for all the language pairs users perceived MT post-editing to be slower, in fact it proved to be a faster option than manual translation for two translation directions out of four, i.e. for Dutch to English, and (marginally) for English to German. For further objective scrutiny, the paper also checks the correlation of three state-of-the-art automatic MT evaluation metrics (BLEU, METEOR and TER) with the actual post-editing time.

In this paper, we provide a description of the Dublin City University’s (DCU) submissions in the IWSLT 2011 evaluationcampaign.1 WeparticipatedintheArabic-Englishand Chinese-English Machine Translation(MT) track translation tasks. We use phrase-based statistical machine translation (PBSMT) models to create the baseline system. Due to the open-domain nature of the data to be translated, we use domain adaptation techniques to improve the quality of translation. Furthermore, we explore target-side syntactic augmentation for an Hierarchical Phrase-Based (HPB) SMT model. Combinatory Categorial Grammar (CCG) is used to extract labels for target-side phrases and non-terminals in the HPB system. Combining the domain adapted language models with the CCG-augmented HPB system gave us the best translations for both language pairs providing statistically significant improvements of 6.09 absolute BLEU points (25.94% relative) and 1.69 absolute BLEU points (15.89% relative) over the unadapted PBSMT baselines for the Arabic-English and Chinese-English language pairs, respectively.

pdf
Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component Level Mixture Modelling
Pratyush Banerjee | Sudip Kumar Naskar | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
A Framework for Diagnostic Evaluation of MT Based on Linguistic Checkpoints
Sudip Kumar Naskar | Antonio Toral | Federico Gaspari | Andy Way
Proceedings of Machine Translation Summit XIII: Papers

2010

pdf abs
Combining Multi-Domain Statistical Machine Translation Models using Automatic Classifiers
Pratyush Banerjee | Jinhua Du | Baoli Li | Sudip Naskar | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper presents a set of experiments on Domain Adaptation of Statistical Machine Translation systems. The experiments focus on Chinese-English and two domain-specific corpora. The paper presents a novel approach for combining multiple domain-trained translation models to achieve improved translation quality for both domain-specific as well as combined sets of sentences. We train a statistical classifier to classify sentences according to the appropriate domain and utilize the corresponding domain-specific MT models to translate them. Experimental results show that the method achieves a statistically significant absolute improvement of 1.58 BLEU (2.86% relative improvement) score over a translation model trained on combined data, and considerable improvements over a model using multiple decoding paths of the Moses decoder, for the combined domain test set. Furthermore, even for domain-specific test sets, our approach works almost as well as dedicated domain-specific models and perfect classification.

pdf abs
Supertags as Source Language Context in Hierarchical Phrase-Based SMT
Rejwanul Haque | Sudip Naskar | Antal van den Bosch | Andy Way
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

Statistical machine translation (SMT) models have recently begun to include source context modeling, under the assumption that the proper lexical choice of the translation for an ambiguous word can be determined from the context in which it appears. Various types of lexical and syntactic features have been explored as effective source context to improve phrase selection in SMT. In the present work, we introduce lexico-syntactic descriptions in the form of supertags as source-side context features in the state-of-the-art hierarchical phrase-based SMT (HPB) model. These features enable us to exploit source similarity in addition to target similarity, as modelled by the language model. In our experiments two kinds of supertags are employed: those from lexicalized tree-adjoining grammar (LTAG) and combinatory categorial grammar (CCG). We use a memory-based classification framework that enables the efficient estimation of these features. Despite the differences between the two supertagging approaches, they give similar improvements. We evaluate the performance of our approach on an English-to-Dutch translation task, and report statistically significant improvements of 4.48% and 6.3% BLEU scores in translation quality when adding CCG and LTAG supertags, respectively, as context-informed features.

pdf
Mitigating Problems in Analogy-based EBMT with SMT and vice versa: A Case Study with Named Entity Transliteration
Sandipan Dandapat | Sara Morrissey | Sudip Kumar Naskar | Harold Somers
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf
Handling Named Entities and Compound Verbs in Phrase-Based Statistical Machine Translation
Santanu Pal | Sudip Kumar Naskar | Pavel Pecina | Sivaji Bandyopadhyay | Andy Way
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

2009

pdf
Using Supertags as Source Language Context in SMT
Rejwanul Haque | Sudip Kumar Naskar | Yanjun Ma | Andy Way
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf
Dependency Relations as Source Context in Phrase-Based SMT
Rejwanul Haque | Sudip Kumar Naskar | Antal van den Bosch | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1

pdf
Experiments on Domain Adaptation for English–Hindi SMT
Rejwanul Haque | Sudip Kumar Naskar | Josef van Genabith | Andy Way
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

2008

pdf
Bengali, Hindi and Telugu to English Ad-hoc Bilingual Task
Sivaji Bandyopadhyay | Tapabrata Mondal | Sudip Kumar Naskar | Asif Ekbal | Rejwanul Haque | Srinivasa Rao Godavarthy
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies

2007

pdf
JU-SKNSB: Extended WordNet Based WSD on the English All-Words Task at SemEval-1
Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf
A Modified Joint Source-Channel Model for Transliteration
Asif Ekbal | Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Handling of Prepositions in English to Bengali Machine Translation
Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of the Third ACL-SIGSEM Workshop on Prepositions

2005

pdf abs
A Phrasal EBMT System for Translating English to Bengali
Sudip Kumar Naskar | Sivaji Bandyopadhyay
Proceedings of Machine Translation Summit X: Posters

The present work describes a Phrasal Example Based Machine Translation system from English to Bengali that identifies the phrases in the input through a shallow analysis, retrieves the target phrases using a Phrasal Example base and finally combines the target language phrases employing some heuristics based on the phrase ordering rules for Bengali. The paper focuses on the structure of the noun, verb and prepositional phrases in English and how these phrases are realized in Bengali. This study has an effect on the design of the phrasal Example Base and recombination rules for the target language phrases.

pdf abs
Use of Machine Translation in India: Current Status
Sudip Naskar | Sivaji Bandyopadhyay
Proceedings of Machine Translation Summit X: Posters

A survey of the machine translation systems that have been developed in India for translation from English to Indian languages and among Indian languages reveals that the MT softwares are used in field testing or are available as web translation service. These systems are also used for teaching machine translation to the students and researchers. Most of these systems are in the English-Hindi or Indian language-Indian language domain. The translation domains are mostly government documents/reports and news stories. There are a number of other MT systems that are at their various phases of development and have been demonstrated at various forums. Many of these systems cover other Indian languages beside Hindi.