Sourav Kumar


2021

pdf
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Misra Sharma | Radhika Mamidi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.

pdf
Efficient Multilingual Text Classification for Indian Languages
Salil Aggarwal | Sourav Kumar | Radhika Mamidi
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

India is one of the richest language hubs on the earth and is very diverse and multilingual. But apart from a few Indian languages, most of them are still considered to be resource poor. Since most of the NLP techniques either require linguistic knowledge that can only be developed by experts and native speakers of that language or they require a lot of labelled data which is again expensive to generate, the task of text classification becomes challenging for most of the Indian languages. The main objective of this paper is to see how one can benefit from the lexical similarity found in Indian languages in a multilingual scenario. Can a classification model trained on one Indian language be reused for other Indian languages? So, we performed zero-shot text classification via exploiting lexical similarity and we observed that our model performs best in those cases where the vocabulary overlap between the language datasets is maximum. Our experiments also confirm that a single multilingual model trained via exploiting language relatedness outperforms the baselines by significant margins.

pdf
Multilingual Multi-Domain NMT for Indian Languages
Sourav Kumar | Salil Aggarwal | Dipti Sharma
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the ๐ˆ๐ง๐๐ข๐š๐ง ๐ฌ๐ฎ๐›๐œ๐จ๐ง๐ญ๐ข๐ง๐ž๐ง๐ญ. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of ๐Ÿ‘.๐Ÿ๐Ÿ“ ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of ๐Ÿ” ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of ๐Ÿ-๐Ÿ.๐Ÿ“ ๐๐‹๐„๐” ๐ฉ๐จ๐ข๐ง๐ญ๐ฌ for the language pair of interest.

pdf
IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Sharma
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper describes the work and the systems submitted by the IIIT-Hyderbad team in the WAT 2021 MultiIndicMT shared task. The task covers 10 major languages of the Indian subcontinent. For the scope of this task, we have built multilingual systems for 20 translation directions namely English-Indic (one-to- many) and Indic-English (many-to-one). Individually, Indian languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. But the multilingual systems are highly complex in terms of time as well as computational resources. Therefore, we are training our systems by efficiently se- lecting data that will actually contribute to most of the learning process. Furthermore, we are also exploiting the language related- ness found in between Indian languages. All the comparisons were made using BLEU score and we found that our final multilingual sys- tem significantly outperforms the baselines by an average of 11.3 and 19.6 BLEU points for English-Indic (en-xx) and Indic-English (xx- en) directions, respectively.

2020

pdf
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Vikrant Goyal | Sourav Kumar | Dipti Misra Sharma
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

A large percentage of the worldโ€™s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.