Dipti Misra Sharma

Also published as: Dipti M Sharma, Dipti M. Sharma, Dipti Misra, Dipti Misra Sharma, Dipti Sharma

2024

pdf abs
Assessing Translation Capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia | Ashok Urlana | Yash Bhaskar | Penumalla Aditya Pavani | Kukkapalli Shravya | Parameswari Krishnamurthy | Dipti Sharma
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Generative Large Language Models (LLMs) have achieved remarkable advances in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large-language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter-efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the model that performs best among the large language models available for the translation task.Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as chrF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using two-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including languages that are currently underrepresented in LLMs.

pdf abs
Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Sankalp Bahad | Pruthwik Mishra | Parameswari Krishnamurthy | Dipti Sharma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Named Entity Recognition (NER) is a use-ful component in Natural Language Process-ing (NLP) applications. It is used in varioustasks such as Machine Translation, Summa-rization, Information Retrieval, and Question-Answering systems. The research on NER iscentered around English and some other ma-jor languages, whereas limited attention hasbeen given to Indian languages. We analyze thechallenges and propose techniques that can betailored for Multilingual Named Entity Recog-nition for Indian Languages. We present a hu-man annotated named entity corpora of ∼40Ksentences for 4 Indian languages from two ofthe major Indian language families. Addition-ally, we show the transfer learning capabilitiesof pre-trained transformer models from a highresource language to multiple low resource lan-guages through a series of experiments. Wealso present a multilingual model fine-tunedon our dataset, which achieves an F1 score of∼0.80 on our dataset on average. We achievecomparable performance on completely unseenbenchmark datasets for Indian languages whichaffirms the usability of our model.

pdf abs
LTRC-IIITH at EHRSQL 2024: Enhancing Reliability of Text-to-SQL Systems through Abstention and Confidence Thresholding
Jerrin Thomas | Pruthwik Mishra | Dipti Sharma | Parameswari Krishnamurthy
Proceedings of the 6th Clinical Natural Language Processing Workshop

In this paper, we present our work in the EHRSQL 2024 shared task which tackles reliable text-to-SQL modeling on Electronic Health Records. Our proposed system tackles the task with three modules - abstention module, text-to-SQL generation module, and reliability module. The abstention module identifies whether the question is answerable given the database schema. If the question is answerable, the text-to-SQL generation module generates the SQL query and associated confidence score. The reliability module has two key components - confidence score thresholding, which rejects generations with confidence below a pre-defined level, and error filtering, which identifies and excludes SQL queries that result in execution errors. In the official leaderboard for the task, our system ranks 6th. We have also made the source code public.

pdf bib abs
Towards Disfluency Annotated Corpora for Indian Languages
Chayan Kochar | Vandan Vasantlal Mujadia | Pruthwik Mishra | Dipti Misra Sharma
Proceedings of the 7th Workshop on Indian Language Data: Resources and Evaluation

In the natural course of spoken language, individuals often engage in thinking and self-correction during speech production. These instances of interruption or correction are commonly referred to as disfluencies. When preparing data for subsequent downstream NLP tasks, these linguistic elements can be systematically removed, or handled as required, to enhance data quality. In this study, we present a comprehensive research on disfluencies in Indian languages. Our approach involves not only annotating real-world conversation transcripts but also conducting a detailed analysis of linguistic nuances inherent to Indian languages that are necessary to consider during annotation. Additionally, we introduce a robust algorithm for the synthetic generation of disfluent data. This algorithm aims to facilitate more effective model training for the identification of disfluencies in real-world conversations, thereby contributing to the advancement of disfluency research in Indian languages.

2023

pdf abs
Towards Speech to Speech Machine Translation focusing on Indian Languages
Vandan Mujadia | Dipti Sharma
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We introduce an SSMT (Speech to Speech Machine Translation, aka Speech to Speech Video Translation) Pipeline(https://ssmt.iiit.ac.in/ssmtiiith), as web application for translating videos from one language to another by cascading multiple language modules. Our speech translation system combines highly accurate speech to text (ASR) for Indian English, pre-possessing modules to bridge ASR-MT gaps such as spoken disfluency and punctuation, robust machine translation (MT) systems for multiple language pairs, SRT module for translated text, text to speech (TTS) module and a module to render translated synthesized audio on the original video. It is user-friendly, flexible, and easily accessible system. We aim to provide a complete configurable speech translation experience to users and researchers with this system. It also supports human intervention where users can edit outputs of different modules and the edited output can then be used for subsequent processing to improve overall output quality. By adopting a human-in-the-loop approach, the aim is to configure technology in such a way where it can assist humans and help to reduce the involved human efforts in speech translation involving English and Indian languages. As per our understanding, this is the first fully integrated system for English to Indian languages (Hindi, Telugu, Gujarati, Marathi and Punjabi) video translation. Our evaluation shows that one can get 3.5+ MOS score using the developed pipeline with human intervention for English to Hindi. A short video demonstrating our system is available at https://youtu.be/MVftzoeRg48.

2022

pdf abs
The LTRC Hindi-Telugu Parallel Corpus
Vandan Mujadia | Dipti Sharma
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the Hindi-Telugu Parallel Corpus of different technical domains such as Natural Science, Computer Science, Law and Healthcare along with the General domain. The qualitative corpus consists of 700K parallel sentences of which 535K sentences were created using multiple methods such as extract, align and review of Hindi-Telugu corpora, end-to-end human translation, iterative back-translation driven post-editing and around 165K parallel sentences were collected from available sources in the public domain. We present the comparative assessment of created parallel corpora for representativeness and diversity. The corpus has been pre-processed for machine translation, and we trained a neural machine translation system using it and report state-of-the-art baseline results on the developed development set over multiple domains and on available benchmarks. With this, we define a new task on Domain Machine Translation for low resource language pairs such as Hindi and Telugu. The developed corpus (535K) is freely available for non-commercial research and to the best of our knowledge, this is the well curated, largest, publicly available domain parallel corpus for Hindi-Telugu.

pdf abs
HAWP: a Dataset for Hindi Arithmetic Word Problem Solving
Harshita Sharma | Pruthwik Mishra | Dipti Sharma
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Word Problem Solving remains a challenging and interesting task in NLP. A lot of research has been carried out to solve different genres of word problems with various complexity levels in recent years. However, most of the publicly available datasets and work has been carried out for English. Recently there has been a surge in this area of word problem solving in Chinese with the creation of large benchmark datastes. Apart from these two languages, labeled benchmark datasets for low resource languages are very scarce. This is the first attempt to address this issue for any Indian Language, especially Hindi. In this paper, we present HAWP (Hindi Arithmetic Word Problems), a dataset consisting of 2336 arithmetic word problems in Hindi. We also developed baseline systems for solving these word problems. We also propose a new evaluation technique for word problem solvers taking equation equivalence into account.

Code-mixed machine translation has become an important task in multilingual communities and extending the task of machine translation to code mixed data has become a common task for these languages. In the shared tasks of EMNLP 2022, we try to tackle the same for both English + Hindi to Hinglish and Hinglish to English. The first task dealt with both Roman and Devanagari script as we had monolingual data in both English and Hindi whereas the second task only had data in Roman script. To our knowledge, we achieved one of the top ROUGE-L and WER scores for the first task of Monolingual to Code-Mixed machine translation. In this paper, we discuss the use of mBART with some special pre-processing and post-processing (transliteration from Devanagari to Roman) for the first task in detail and the experiments that we performed for the second task of translating code-mixed Hinglish to monolingual English.

2021

pdf abs
Assessing Post-editing Effort in the English-Hindi Direction
Arafat Ahsan | Vandan Mujadia | Dipti Misra Sharma
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

We present findings from a first in-depth post-editing effort estimation study in the English-Hindi direction along multiple effort indicators. We conduct a controlled experiment involving professional translators, who complete assigned tasks alternately, in a translation from scratch and a post-edit condition. We find that post-editing reduces translation time (by 63%), utilizes fewer keystrokes (by 59%), and decreases the number of pauses (by 63%) when compared to translating from scratch. We further verify the quality of translations thus produced via a human evaluation task in which we do not detect any discernible quality differences.

pdf abs
Stress Rules from Surface Forms: Experiments with Program Synthesis
Saujas Vaduguru | Partho Sarthi | Monojit Choudhury | Dipti Sharma
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Learning linguistic generalizations from only a few examples is a challenging task. Recent work has shown that program synthesis – a method to learn rules from data in the form of programs in a domain-specific language – can be used to learn phonological rules in highly data-constrained settings. In this paper, we use the problem of phonological stress placement as a case to study how the design of the domain-specific language influences the generalization ability when using the same learning algorithm. We find that encoding the distinction between consonants and vowels results in much better performance, and providing syllable-level information further improves generalization. Program synthesis, thus, provides a way to investigate how access to explicit linguistic information influences what can be learnt from a small number of examples.

pdf abs
Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems
Saujas Vaduguru | Aalok Sathe | Monojit Choudhury | Dipti Sharma
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples. In this paper, we ask: Can we learn explicit rules that generalize well from only a few examples? We explore this question using program synthesis. We develop a synthesis model to learn phonology rules as programs in a domain-specific language. We test the ability of our models to generalize from few training examples using our new dataset of problems from the Linguistics Olympiad, a challenging set of tasks that require strong linguistic reasoning ability. In addition to being highly sample-efficient, our approach generates human-readable programs, and allows control over the generalizability of the learnt programs.

pdf abs
How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Misra Sharma | Radhika Mamidi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.

pdf abs
English-Marathi Neural Machine Translation for LoResMT 2021
Vandan Mujadia | Dipti Misra Sharma
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

In this paper, we (team - oneNLP-IIITH) describe our Neural Machine Translation approaches for English-Marathi (both direction) for LoResMT-20211 . We experimented with transformer based Neural Machine Translation and explored the use of different linguistic features like POS and Morph on subword unit for both English-Marathi and Marathi-English. In addition, we have also explored forward and backward translation using web-crawled monolingual data. We obtained 22.2 (overall 2 nd) and 31.3 (overall 1 st) BLEU scores for English-Marathi and Marathi-English on respectively

pdf abs
Domain Adaptation for Hindi-Telugu Machine Translation Using Domain Specific Back Translation
Hema Ala | Vandan Mujadia | Dipti Sharma
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we present a novel approachfor domain adaptation in Neural MachineTranslation which aims to improve thetranslation quality over a new domain. Adapting new domains is a highly challeng-ing task for Neural Machine Translation onlimited data, it becomes even more diffi-cult for technical domains such as Chem-istry and Artificial Intelligence due to spe-cific terminology, etc. We propose DomainSpecific Back Translation method whichuses available monolingual data and gen-erates synthetic data in a different way. This approach uses Out Of Domain words. The approach is very generic and can beapplied to any language pair for any domain. We conduct our experiments onChemistry and Artificial Intelligence do-mains for Hindi and Telugu in both direc-tions. It has been observed that the usageof synthetic data created by the proposedalgorithm improves the BLEU scores significantly.

pdf abs
Multilingual Multi-Domain NMT for Indian Languages
Sourav Kumar | Salil Aggarwal | Dipti Sharma
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the 𝐈𝐧𝐝𝐢𝐚𝐧 𝐬𝐮𝐛𝐜𝐨𝐧𝐭𝐢𝐧𝐞𝐧𝐭. We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments: (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of 𝟑.𝟐𝟓 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of 𝟔 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of 𝟏-𝟏.𝟓 𝐁𝐋𝐄𝐔 𝐩𝐨𝐢𝐧𝐭𝐬 for the language pair of interest.

pdf abs
IIIT Hyderabad Submission To WAT 2021: Efficient Multilingual NMT systems for Indian languages
Sourav Kumar | Salil Aggarwal | Dipti Sharma
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper describes the work and the systems submitted by the IIIT-Hyderbad team in the WAT 2021 MultiIndicMT shared task. The task covers 10 major languages of the Indian subcontinent. For the scope of this task, we have built multilingual systems for 20 translation directions namely English-Indic (one-to- many) and Indic-English (many-to-one). Individually, Indian languages are resource poor which hampers translation quality but by leveraging multilingualism and abundant monolingual corpora, the translation quality can be substantially boosted. But the multilingual systems are highly complex in terms of time as well as computational resources. Therefore, we are training our systems by efficiently se- lecting data that will actually contribute to most of the learning process. Furthermore, we are also exploiting the language related- ness found in between Indian languages. All the comparisons were made using BLEU score and we found that our final multilingual sys- tem significantly outperforms the baselines by an average of 11.3 and 19.6 BLEU points for English-Indic (en-xx) and Indic-English (xx- en) directions, respectively.

pdf abs
Low Resource Similar Language Neural Machine Translation for Tamil-Telugu
Vandan Mujadia | Dipti Sharma
Proceedings of the Sixth Conference on Machine Translation

This paper describes the participation of team oneNLP (LTRC, IIIT-Hyderabad) for the WMT 2021 task, similar language translation. We experimented with transformer based Neural Machine Translation and explored the use of language similarity for Tamil-Telugu and Telugu-Tamil. We incorporated use of different subword configurations, script conversion and single model training for both directions as exploratory experiments.

pdf bib abs
A Transformer Based Approach towards Identification of Discourse Unit Segments and Connectives
Sahil Bakshi | Dipti Sharma
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)

Discourse parsing, which involves understanding the structure, information flow, and modeling the coherence of a given text, is an important task in natural language processing. It forms the basis of several natural language processing tasks such as question-answering, text summarization, and sentiment analysis. Discourse unit segmentation is one of the fundamental tasks in discourse parsing and refers to identifying the elementary units of text that combine to form a coherent text. In this paper, we present a transformer based approach towards the automated identification of discourse unit segments and connectives. Early approaches towards segmentation relied on rule-based systems using POS tags and other syntactic information to identify discourse segments. Recently, transformer based neural systems have shown promising results in this domain. Our system, SegFormers, employs this transformer based approach to perform multilingual discourse segmentation and connective identification across 16 datasets encompassing 11 languages and 3 different annotation frameworks. We evaluate the system based on F1 scores for both tasks, with the best system reporting the highest F1 score of 97.02% for the treebanked English RST-DT dataset.

2020

pdf abs
A Fully Expanded Dependency Treebank for Telugu
Sneha Nallani | Manish Shrivastava | Dipti Sharma
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

Treebanks are an essential resource for syntactic parsing. The available Paninian dependency treebank(s) for Telugu is annotated only with inter-chunk dependency relations and not all words of a sentence are part of the parse tree. In this paper, we automatically annotate the intra-chunk dependencies in the treebank using a Shift-Reduce parser based on Context Free Grammar rules for Telugu chunks. We also propose a few additional intra-chunk dependency relations for Telugu apart from the ones used in Hindi treebank. Annotating intra-chunk dependencies finally provides a complete parse tree for every sentence in the treebank. Having a fully expanded treebank is crucial for developing end to end parsers which produce complete trees. We present a fully expanded dependency treebank for Telugu consisting of 3220 sentences. In this paper, we also convert the treebank annotated with Anncorra part-of-speech tagset to the latest BIS tagset. The BIS tagset is a hierarchical tagset adopted as a unified part-of-speech standard across all Indian Languages. The final treebank is made publicly available.

abs
Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features
Aamir Farhan | Mashrukh Islam | Dipti Misra Sharma
Proceedings of the Fourth Widening Natural Language Processing Workshop

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F 1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.

pdf abs
NMT based Similar Language Translation for Hindi - Marathi
Vandan Mujadia | Dipti Sharma
Proceedings of the Fifth Conference on Machine Translation

This paper describes the participation of team F1toF6 (LTRC, IIIT-Hyderabad) for the WMT 2020 task, similar language translation. We experimented with attention based recurrent neural network architecture (seq2seq) for this task. We explored the use of different linguistic features like POS and Morph along with back translation for Hindi-Marathi and Marathi-Hindi machine translation.

pdf abs
Linguistically Informed Hindi-English Neural Machine Translation
Vikrant Goyal | Pruthwik Mishra | Dipti Misra Sharma
Proceedings of the Twelfth Language Resources and Evaluation Conference

Hindi-English Machine Translation is a challenging problem, owing to multiple factors including the morphological complexity and relatively free word order of Hindi, in addition to the lack of sufficient parallel training data. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. To overcome the data sparsity issue caused by the lack of large parallel corpora for Hindi-English, we propose a method to employ additional linguistic knowledge which is encoded by different phenomena depicted by Hindi. We generalize the embedding layer of the state-of-the-art Transformer model to incorporate linguistic features like POS tag, lemma and morph features to improve the translation performance. We compare the results obtained on incorporating this knowledge with the baseline systems and demonstrate significant performance improvements. Although, the Transformer NMT models have a strong efficacy to learn language constructs, we show that the usage of specific features further help in improving the translation performance.

pdf abs
A Simple and Effective Dependency Parser for Telugu
Sneha Nallani | Manish Shrivastava | Dipti Sharma
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We present a simple and effective dependency parser for Telugu, a morphologically rich, free word order language. We propose to replace the rich linguistic feature templates used in the past approaches with a minimal feature function using contextual vector representations. We train a BERT model on the Telugu Wikipedia data and use vector representations from this model to train the parser. Each sentence token is associated with a vector representing the token in the context of that sentence and the feature vectors are constructed by concatenating two token representations from the stack and one from the buffer. We put the feature representations through a feedforward network and train with a greedy transition based approach. The resulting parser has a very simple architecture with minimal feature engineering and achieves state-of-the-art results for Telugu.

pdf abs
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Vikrant Goyal | Sourav Kumar | Dipti Misra Sharma
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

A large percentage of the world’s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.

pdf abs
Checkpoint Reranking: An Approach to Select Better Hypothesis for Neural Machine Translation Systems
Vinay Pandramish | Dipti Misra Sharma
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In this paper, we propose a method of re-ranking the outputs of Neural Machine Translation (NMT) systems. After the decoding process, we select a few last iteration outputs in the training process as the N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system.We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline.We also evaluate our approach by applying the coverage penalty to the training process.In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points.With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty.The proposed re-ranking method is a generic one and can be extended to other language pairs as well.

pdf bib
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Pushpak Bhattacharyya | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

pdf abs
Polarization and its Life on Social Media: A Case Study on Sabarimala and Demonetisation
Ashutosh Ranjan | Dipti Sharma | Radhika Krishnan
Proceedings of the 17th International Conference on Natural Language Processing (ICON)

This paper is an attempt to study polarisation on social media data. We focus on two hugely controversial and talked about events in the Indian diaspora, namely 1) the Sabarimala Temple (located in Kerala, India) incident which became a nationwide controversy when two women under the age of 50 secretly entered the temple breaking a long standing temple rule that disallowed women of menstruating age (10-50) to enter the temple and 2) the Indian government’s move to demonetise all existing 500 and 1000 denomination banknotes, comprising of 86% of the currency in circulation, in November 2016. We gather tweets around these two events in various time periods, preprocess and annotate them with their sentiment polarity and emotional category, and analyse trends to help us understand changing polarity over time around controversial events. The tweets collected are in English, Hindi and code-mixed Hindi-English. Apart from the analysis on the annotated data, we also present the twitter data comprising a total of around 1.5 million tweets.

pdf abs
Automatic Technical Domain Identification
Hema Ala | Dipti Sharma
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

In this paper we present two Machine Learning algorithms namely Stochastic Gradient Descent and Multi Layer Perceptron to Identify the technical domain of given text as such text provides information about the specific domain. We performed our experiments on Coarse-grained technical domains like Computer Science, Physics, Law, etc for English, Bengali, Gujarati, Hindi, Malayalam, Marathi, Tamil, and Telugu languages, and on fine-grained sub domains for Computer Science like Operating System, Computer Network, Database etc for only English language. Using TFIDF as a feature extraction method we show how both the machine learning models perform on the mentioned languages.

pdf bib abs
Graph Based Automatic Domain Term Extraction
Hema Ala | Dipti Sharma
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TermTraction 2020 Shared Task

We present a Graph Based Approach to automatically extract domain specific terms from technical domains like Biochemistry, Communication, Computer Science and Law. Our approach is similar to TextRank with an extra post-processing step to reduce the noise. We performed our experiments on the mentioned domains provided by ICON TermTraction - 2020 shared task. Presented precision, recall and f1-score for all experiments. Further, it is observed that our method gives promising results without much noise in domain terms.

pdf bib abs
AdapNMT : Neural Machine Translation with Technical Domain Adaptation for Indic Languages
Hema Ala | Dipti Sharma
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

Adapting new domain is highly challenging task for Neural Machine Translation (NMT). In this paper we show the capability of general domain machine translation when translating into Indic languages (English - Hindi , English - Telugu and Hindi - Telugu), and low resource domain adaptation of MT systems using existing general parallel data and small in domain parallel data for AI and Chemistry Domains. We carried out our experiments using Byte Pair Encoding(BPE) as it solves rare word problems. It has been observed that with addition of little amount of in-domain data to the general data improves the BLEU score significantly.

2019

pdf bib
Proceedings of the 16th International Conference on Natural Language Processing
Dipti Misra Sharma | Pushpak Bhattacharya
Proceedings of the 16th International Conference on Natural Language Processing

pdf abs
Dataset for Aspect Detection on Mobile reviews in Hindi
Pruthwik Mishra | Ayush Joshi | Dipti Sharma
Proceedings of the 16th International Conference on Natural Language Processing

In recent years Opinion Mining has become one of the very interesting fields of Language Processing. To extract the gist of a sentence in a shorter and efficient manner is what opinion mining provides. In this paper we focus on detecting aspects for a particular domain. While relevant research work has been done in aspect detection in resource rich languages like English, we are trying to do the same in a relatively resource poor Hindi language. Here we present a corpus of mobile reviews which are labelled with carefully curated aspects. The motivation behind Aspect detection is to get information on a finer level about the data. In this paper we identify all aspects related to the gadget which are present on the reviews given online on various websites. We also propose baseline models to detect aspects in Hindi text after conducting various experiments.

pdf abs
Towards Handling Verb Phrase Ellipsis in English-Hindi Machine Translation
Niyati Bafna | Dipti Sharma
Proceedings of the 16th International Conference on Natural Language Processing

English-Hindi machine translation systems have difficulty interpreting verb phrase ellipsis (VPE) in English, and commit errors in translating sentences with VPE. We present a solution and theoretical backing for the treatment of English VPE, with the specific scope of enabling English-Hindi MT, based on an understanding of the syntactical phenomenon of verb-stranding verb phrase ellipsis in Hindi (VVPE). We implement a rule-based system to perform the following sub-tasks: 1) Verb ellipsis identification in the English source sentence, 2) Elided verb phrase head identification 3) Identification of verb segment which needs to be induced at the site of ellipsis 4) Modify input sentence; i.e. resolving VPE and inducing the required verb segment. This system obtains 94.83 percent precision and 83.04 percent recall on subtask (1), tested on 3900 sentences from the BNC corpus. This is competitive with state-of-the-art results. We measure accuracy of subtasks (2) and (3) together, and obtain a 91 percent accuracy on 200 sentences taken from the WSJ corpus. Finally, in order to indicate the relevance of ellipsis handling to MT, we carried out a manual analysis of the English-Hindi MT outputs of 100 sentences after passing it through our system. We set up a basic metric (1-5) for this evaluation, where 5 indicates drastic improvement, and obtained an average of 3.55. As far as we know, this is the first attempt to target ellipsis resolution in the context of improving English-Hindi machine translation.

pdf abs
Kunji : A Resource Management System for Higher Productivity in Computer Aided Translation Tools
Priyank Gupta | Manish Shrivastava | Dipti Misra Sharma | Rashid Ahmad
Proceedings of the 16th International Conference on Natural Language Processing

Complex NLP applications, such as machine translation systems, utilize various kinds of resources namely lexical, multiword, domain dictionaries, maps and rules etc. Similarly, translators working on Computer Aided Translation workbenches, also require help from various kinds of resources - glossaries, terminologies, concordances and translation memory in the workbenches in order to increase their productivity. Additionally, translators have to look away from the workbenches for linguistic resources like Named Entities, Multiwords, lexical and lexeme dictionaries in order to get help, as the available resources like concordances, terminologies and glossaries are often not enough. In this paper we present Kunji, a resource management system for translation workbenches and MT modules. This system can be easily integrated in translation workbenches and can also be used as a management tool for resources for MT systems. The described resource management system has been integrated in a translation workbench Transzaar. We also study the impact of providing this resource management system along with linguistic resources on the productivity of translators for English-Hindi language pair. When the linguistic resources like lexeme, NER and MWE dictionaries were made available to translators in addition to their regular translation memories, concordances and terminologies, their productivity increased by 15.61%.

pdf abs
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
Vikrant Goyal | Dipti Misra Sharma
Proceedings of the 6th Workshop on Asian Translation

This paper describes the Neural Machine Translation systems of IIIT-Hyderabad (LTRC-MT) for WAT 2019 Hindi-English shared task. We experimented with both Recurrent Neural Networks & Transformer architectures. We also show the results of our experiments of training NMT models using additional data via backtranslation.

pdf abs
Towards Automated Semantic Role Labelling of Hindi-English Code-Mixed Tweets
Riya Pal | Dipti Sharma
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

We present a system for automating Semantic Role Labelling of Hindi-English code-mixed tweets. We explore the issues posed by noisy, user generated code-mixed social media data. We also compare the individual effect of various linguistic features used in our system. Our proposed model is a 2-step system for automated labelling which gives an overall accuracy of 84% for Argument Classification, marking a 10% increase over the existing rule-based baseline model. This is the first attempt at building a statistical Semantic Role Labeller for Hindi-English code-mixed data, to the best of our knowledge.

pdf abs
A Dataset for Semantic Role Labelling of Hindi-English Code-Mixed Tweets
Riya Pal | Dipti Sharma
Proceedings of the 13th Linguistic Annotation Workshop

We present a data set of 1460 Hindi-English code-mixed tweets consisting of 20,949 tokens labelled with Proposition Bank labels marking their semantic roles. We created verb frames for complex predicates present in the corpus and formulated mappings from Paninian dependency labels to Proposition Bank labels. With the help of these mappings and the dependency tree, we propose a baseline rule based system for Semantic Role Labelling of Hindi-English code-mixed data. We obtain an accuracy of 96.74% for Argument Identification and are able to further classify 73.93% of the labels correctly. While there is relevant ongoing research on Semantic Role Labelling and on building tools for code-mixed social media data, this is the first attempt at labelling semantic roles in code-mixed data, to the best of our knowledge.

pdf abs
The IIIT-H Gujarati-English Machine Translation System for WMT19
Vikrant Goyal | Dipti Misra Sharma
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the Neural Machine Translation system of IIIT-Hyderabad for the Gujarati→English news translation shared task of WMT19. Our system is basedon encoder-decoder framework with attention mechanism. We experimented with Multilingual Neural MT models. Our experiments show that Multilingual Neural Machine Translation leveraging parallel data from related language pairs helps in significant BLEU improvements upto 11.5, for low resource language pairs like Gujarati-English

pdf
Pāṇinian Syntactico-Semantic Relation Labels
Amba Kulkarni | Dipti Sharma
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

pdf
Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank
Urmi Ghosh | Dipti Sharma | Simran Khanuja
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf bib
Proceedings of the 15th International Conference on Natural Language Processing
Gurpreet Singh Lehal | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the 15th International Conference on Natural Language Processing

pdf abs
Universal Dependency Parsing for Hindi-English Code-Switching
Irshad Bhat | Riyaz A. Bhat | Manish Shrivastava | Dipti Sharma
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Code-switching is a phenomenon of mixing grammatical structures of two or more languages under varied social constraints. The code-switching data differ so radically from the benchmark corpora used in NLP community that the application of standard technologies to these data degrades their performance sharply. Unlike standard corpora, these data often need to go through additional processes such as language identification, normalization and/or back-transliteration for their efficient processing. In this paper, we investigate these indispensable processes and other problems associated with syntactic parsing of code-switching data and propose methods to mitigate their effects. In particular, we study dependency parsing of code-switching data of Hindi and English multilingual speakers from Twitter. We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages the part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks. We also present normalization and back-transliteration models with a decoding process tailored for code-switching data. Results show that our neural stacking parser is 1.5% LAS points better than the augmented parsing model and 3.8% LAS points better than the one which uses first-best normalization and/or back-transliteration.

pdf
IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection
Abhishek Sharma | Ganesh Katrapati | Dipti Misra Sharma
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf
EquGener: A Reasoning Network for Word Problem Solving by Generating Arithmetic Equations
Pruthwik Mishra | Litton J Kurisinkel | Dipti Misra Sharma | Vasudeva Varma
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
No more beating about the bush : A Step towards Idiom Handling for Indian Language NLP
Ruchit Agrawal | Vighnesh Chenthil Kumar | Vigneshwaran Muralidharan | Dipti Sharma
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Deep Neural Network based system for solving Arithmetic Word problems
Purvanshi Mehta | Pruthwik Mishra | Vinayak Athavale | Manish Shrivastava | Dipti Sharma
Proceedings of the IJCNLP 2017, System Demonstrations

This paper presents DILTON a system which solves simple arithmetic word problems. DILTON uses a Deep Neural based model to solve math word problems. DILTON divides the question into two parts - worldstate and query. The worldstate and the query are processed separately in two different networks and finally, the networks are merged to predict the final operation. We report the first deep learning approach for the prediction of operation between two numbers. DILTON learns to predict operations with 88.81% accuracy in a corpus of primary school questions.

pdf abs
Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data
Irshad Bhat | Riyaz A. Bhat | Manish Shrivastava | Dipti Sharma
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this paper, we propose efficient and less resource-intensive strategies for parsing of code-mixed data. These strategies are not constrained by in-domain annotations, rather they leverage pre-existing monolingual annotated resources for training. We show that these methods can produce significantly better results as compared to an informed baseline. Due to lack of an evaluation set for code-mixed structures, we also present a data set of 450 Hindi and English code-mixed tweets of Hindi multilingual speakers for evaluation.

pdf abs
Leveraging Newswire Treebanks for Parsing Conversational Data with Argument Scrambling
Riyaz A. Bhat | Irshad Bhat | Dipti Sharma
Proceedings of the 15th International Conference on Parsing Technologies

We investigate the problem of parsing conversational data of morphologically-rich languages such as Hindi where argument scrambling occurs frequently. We evaluate a state-of-the-art non-linear transition-based parsing system on a new dataset containing 506 dependency trees for sentences from Bollywood (Hindi) movie scripts and Twitter posts of Hindi monolingual speakers. We show that a dependency parser trained on a newswire treebank is strongly biased towards the canonical structures and degrades when applied to conversational data. Inspired by Transformational Generative Grammar (Chomsky, 1965), we mitigate the sampling bias by generating all theoretically possible alternative word orders of a clause from the existing (kernel) structures in the treebank. Training our parser on canonical and transformed structures improves performance on conversational data by around 9% LAS over the baseline newswire parser.

pdf
Unity in Diversity: A Unified Parsing Strategy for Major Indian Languages
Juhi Tandon | Dipti Misra Sharma
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf
Three-phase training to address data sparsity in Neural Machine Translation
Ruchit Agrawal | Mihir Shekhar | Dipti Sharma
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
A vis-à-vis evaluation of MT paradigms for linguistically distant languages
Ruchit Agrawal | Jahfar Ali | Dipti Misra Sharma
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
POS Tagging For Resource Poor Languages Through Feature Projection
Pruthwik Mishra | Vandan Mujadia | Dipti Misra Sharma
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Linguistic approach based Transfer Learning for Sentiment Classification in Hindi
Vartika Rai | Sakshee Vijay | Dipti Misra
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

pdf
Semisupervied Data Driven Word Sense Disambiguation for Resource-poor Languages
Pratibha Rani | Vikram Pudi | Dipti M. Sharma
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf
Shallow Parsing Pipeline - Hindi-English Code-Mixed Social Media Text
Arnav Sharma | Sakshi Gupta | Raveesh Motlani | Piyush Bansal | Manish Shrivastava | Radhika Mamidi | Dipti M. Sharma
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Explicit Argument Identification for Discourse Parsing In Hindi: A Hybrid Pipeline
Rohit Jain | Dipti Sharma
Proceedings of the NAACL Student Research Workshop

pdf
Non-decreasing Sub-modular Function for Comprehensible Summarization
Litton J Kurisinkel | Pruthwik Mishra | Vigneshwaran Muralidaran | Vasudeva Varma | Dipti Misra Sharma
Proceedings of the NAACL Student Research Workshop

pdf
Kathaa: A Visual Programming Framework for NLP Applications
Sharada Prasanna Mohanty | Nehal J Wani | Manish Srivastava | Dipti Misra Sharma
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf
Conversion from Paninian Karakas to Universal Dependencies for Hindi Dependency Treebank
Juhi Tandon | Himani Chaudhry | Riyaz Ahmad Bhat | Dipti Sharma
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib abs
Kathaa : NLP Systems as Edge-Labeled Directed Acyclic MultiGraphs
Sharada Mohanty | Nehal J Wani | Manish Srivastava | Dipti Sharma
Proceedings of the Third International Workshop on Worldwide Language Service Infrastructure and Second Workshop on Open Infrastructures and Analysis Frameworks for Human Language Technologies (WLSI/OIAF4HLT2016)

We present Kathaa, an Open Source web-based Visual Programming Framework for Natural Language Processing (NLP) Systems. Kathaa supports the design, execution and analysis of complex NLP systems by visually connecting NLP components from an easily extensible Module Library. It models NLP systems an edge-labeled Directed Acyclic MultiGraph, and lets the user use publicly co-created modules in their own NLP applications irrespective of their technical proficiency in Natural Language Processing. Kathaa exposes an intuitive web based Interface for the users to interact with and modify complex NLP Systems; and a precise Module definition API to allow easy integration of new state of the art NLP components. Kathaa enables researchers to publish their services in a standardized format to enable the masses to use their services out of the box. The vision of this work is to pave the way for a system like Kathaa, to be the Lego blocks of NLP Research and Applications. As a practical use case we use Kathaa to visually implement the Sampark Hindi-Panjabi Machine Translation Pipeline and the Sampark Hindi-Urdu Machine Translation Pipeline, to demonstrate the fact that Kathaa can handle really complex NLP systems while still being intuitive for the end user.

pdf bib
Proceedings of the 13th International Conference on Natural Language Processing
Dipti Misra Sharma | Rajeev Sangal | Anil Kumar Singh
Proceedings of the 13th International Conference on Natural Language Processing

pdf abs
Coreference Annotation Scheme and Relation Types for Hindi
Vandan Mujadia | Palash Gupta | Dipti Misra Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes a coreference annotation scheme, coreference annotation specific issues and their solutions through our proposed annotation scheme for Hindi. We introduce different co-reference relation types between continuous mentions of the same coreference chain such as “Part-of”, “Function-value pair” etc. We used Jaccard similarity based Krippendorff‘s’ alpha to demonstrate consistency in annotation scheme, annotation and corpora. To ease the coreference annotation process, we built a semi-automatic Coreference Annotation Tool (CAT). We also provide statistics of coreference annotation on Hindi Dependency Treebank (HDTB).

pdf abs
Using lexical and Dependency Features to Disambiguate Discourse Connectives in Hindi
Rohit Jain | Himanshu Sharma | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Discourse parsing is a challenging task in NLP and plays a crucial role in discourse analysis. To enable discourse analysis for Hindi, Hindi Discourse Relations Bank was created on a subset of Hindi TreeBank. The benefits of a discourse analyzer in automated discourse analysis, question summarization and question answering domains has motivated us to begin work on a discourse analyzer for Hindi. In this paper, we focus on discourse connective identification for Hindi. We explore various available syntactic features for this task. We also explore the use of dependency tree parses present in the Hindi TreeBank and study the impact of the same on the performance of the system. We report that the novel dependency features introduced have a higher impact on precision, in comparison to the syntactic features previously used for this task. In addition, we report a high accuracy of 96% for this task.

This paper describes our efforts for the development of a Proposition Bank for Urdu, an Indo-Aryan language. Our primary goal is the labeling of syntactic nodes in the existing Urdu dependency Treebank with specific argument labels. In essence, it involves annotation of predicate argument structures of both simple and complex predicates in the Treebank corpus. We describe the overall process of building the PropBank of Urdu. We discuss various statistics pertaining to the Urdu PropBank and the issues which the annotators encountered while developing the PropBank. We also discuss how these challenges were addressed to successfully expand the PropBank corpus. While reporting the Inter-annotator agreement between the two annotators, we show that the annotators share similar understanding of the annotation guidelines and of the linguistic phenomena present in the language. The present size of this Propbank is around 180,000 tokens which is double-propbanked by the two annotators for simple predicates. Another 100,000 tokens have been annotated for complex predicates of Urdu.

pdf abs
A Finite-State Morphological Analyser for Sindhi
Raveesh Motlani | Francis Tyers | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Morphological analysis is a fundamental task in natural-language processing, which is used in other NLP applications such as part-of-speech tagging, syntactic parsing, information retrieval, machine translation, etc. In this paper, we present our work on the development of free/open-source finite-state morphological analyser for Sindhi. We have used Apertium’s lttoolbox as our finite-state toolkit to implement the transducer. The system is developed using a paradigm-based approach, wherein a paradigm defines all the word forms and their morphological features for a given stem (lemma). We have evaluated our system on the Sindhi Wikipedia corpus and achieved a reasonable coverage of 81% and a precision of over 97%.

pdf abs
Towards Building Semantic Role Labeler for Indian Languages
Maaz Anwar | Dipti Sharma
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a statistical system for identifying the semantic relationships or semantic roles for two major Indian Languages, Hindi and Urdu. Given an input sentence and a predicate/verb, the system first identifies the arguments pertaining to that verb and then classifies it into one of the semantic labels which can either be a DOER, THEME, LOCATIVE, CAUSE, PURPOSE etc. The system is based on 2 statistical classifiers trained on roughly 130,000 words for Urdu and 100,000 words for Hindi that were hand-annotated with semantic roles under the PropBank project for these two languages. Our system achieves an accuracy of 86% in identifying the arguments of a verb for Hindi and 75% for Urdu. At the subsequent task of classifying the constituents into their semantic roles, the Hindi system achieved 58% precision and 42% recall whereas Urdu system performed better and achieved 83% precision and 80% recall. Our study also allowed us to compare the usefulness of different linguistic features and feature combinations in the semantic role labeling task. We also examine the use of statistical syntactic parsing as feature in the role labeling task.

pdf abs
A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu
Riyaz A. Bhat | Irshad A. Bhat | Naman Jain | Dipti Misra Sharma
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this article, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentence-level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.

pdf
Significance of an Accurate Sandhi-Splitter in Shallow Parsing of Dravidian Languages
Devadath V V | Dipti Misra Sharma
Proceedings of the ACL 2016 Student Research Workshop

2015

pdf
Exploring the effect of semantic similarity for Phrase-based Machine Translation
Kunal Sachdeva | Dipti Sharma
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
Proceedings of the 12th International Conference on Natural Language Processing
Dipti Misra Sharma | Rajeev Sangal | Elizabeth Sherly
Proceedings of the 12th International Conference on Natural Language Processing

pdf
Applying Sanskrit Concepts for Reordering in MT
Akshar Bharati | Sukhada | Prajna Jha | Soma Paul | Dipti M Sharma
Proceedings of the 12th International Conference on Natural Language Processing

2014

pdf abs
Benchmarking of English-Hindi parallel corpora
Jayendra Rakesh Yeka | Prasanth Kolachina | Dipti Misra Sharma
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present several parallel corpora for EnglishâHindi and talk about their natures and domains. We also discuss briefly a few previous attempts in MT for translation from English to Hindi. The lack of uniformly annotated data makes it difficult to compare these attempts and precisely analyze their strengths and shortcomings. With this in mind, we propose a standard pipeline to provide uniform linguistic annotations to these resources using state-of-art NLP technologies. We conclude the paper by presenting evaluation scores of different statistical MT systems on the corpora detailed in this paper for EnglishâHindi and present the proposed plans for future work. We hope that both these annotated parallel corpora resources and MT systems will serve as benchmarks for future approaches to MT in EnglishâHindi. This was and remains the main motivation for the attempts detailed in this paper.

pdf abs
Towards building a Kashmiri Treebank: Setting up the Annotation Pipeline
Riyaz Ahmad Bhat | Shahid Mushtaq Bhat | Dipti Misra Sharma
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Kashmiri is a resource poor language with very less computational and language resources available for its text processing. As the main contribution of this paper, we present an initial version of the Kashmiri Dependency Treebank. The treebank consists of 1,000 sentences (17,462 tokens), annotated with part-of-speech (POS), chunk and dependency information. The treebank has been manually annotated using the Paninian Computational Grammar (PCG) formalism (Begum et al., 2008; Bharati et al., 2009). This version of Kashmiri treebank is an extension of its earlier verion of 500 sentences (Bhat, 2012), a pilot experiment aimed at defining the annotation guidelines on a small subset of Kashmiri corpora. In this paper, we have refined the guidelines with some significant changes and have carried out inter-annotator agreement studies to ascertain its quality. We also present a dependency parsing pipeline, consisting of a tokenizer, a stemmer, a POS tagger, a chunker and an inter-chunk dependency parser. It, therefore, constitutes the first freely available, open source dependency parser of Kashmiri, setting the initial baseline for Kashmiri dependency parsing.

pdf abs
Hindi to English Machine Translation: Using Effective Selection in Multi-Model SMT
Kunal Sachdeva | Rishabh Srivastava | Sambhav Jain | Dipti Sharma
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Recent studies in machine translation support the fact that multi-model systems perform better than the individual models. In this paper, we describe a Hindi to English statistical machine translation system and improve over the baseline using multiple translation models. We have considered phrase based as well as hierarchical models and enhanced over both these baselines using a regression model. The system is trained over textual as well as syntactic features extracted from source and target of the aforementioned translations. Our system shows significant improvement over the baseline systems for both automatic as well as human evaluations. The proposed methodology is quite generic and easily be extended to other language pairs as well.

pdf
Reducing the Impact of Data Sparsity in Statistical Machine Translation
Karan Singla | Kunal Sachdeva | Srinivas Bangalore | Dipti Misra Sharma | Diksha Yadav
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf
Exploring System Combination approaches for Indo-Aryan MT Systems
Karan Singla | Anupam Singh | Nishkarsh Shastri | Megha Jhunjhunwala | Srinivas Bangalore | Dipti Misra Sharma
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

pdf bib
Proceedings of the 11th International Conference on Natural Language Processing
Dipti Misra Sharma | Rajeev Sangal | Jyoti D. Pawar
Proceedings of the 11th International Conference on Natural Language Processing

pdf
Identification of Karaka relations in an English sentence
Sai Kiran Gorthi | Ashish Palakurthi | Radhika Mamidi | Dipti Misra Sharma
Proceedings of the 11th International Conference on Natural Language Processing

pdf
A Sandhi Splitter for Malayalam
Devadath V V | Litton J Kurisinkel | Dipti Misra Sharma | Vasudeva Varma
Proceedings of the 11th International Conference on Natural Language Processing

pdf
Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing

pdf
SSF: A Common Representation Scheme for Language Analysis for Language Technology Infrastructure Development
Akshar Bharati | Rajeev Sangal | Dipti Misra Sharma | Anil Kumar Singh
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

pdf
Exploring the effects of Sentence Simplification on Hindi to English Machine Translation System
Kshitij Mishra | Ankush Soni | Rahul Sharma | Dipti Sharma
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

2013

pdf
Animacy Annotation in the Hindi Treebank
Itisree Jena | Riyaz Ahmad Bhat | Sambhav Jain | Dipti Misra Sharma
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf
Divergences in English-Hindi Parallel Dependency Treebanks
Himani Chaudhry | Himanshu Sharma | Dipti Misra Sharma
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
Towards Building Parallel Dependency Treebanks: Intra-Chunk Expansion and Alignment for English Dependency Treebank
Debanka Nandi | Maaz Nomani | Himanshu Sharma | Himani Chaudhary | Sambhav Jain | Dipti Misra Sharma
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

pdf
Animacy Acquisition Using Morphological Case
Riyaz Ahmad Bhat | Dipti Misra Sharma
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Exploring Semantic Information in Hindi WordNet for Hindi Dependency Parsing
Sambhav Jain | Naman Jain | Aniruddha Tammewar | Riyaz Ahmad Bhat | Dipti Sharma
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
A Hybrid Approach for Anaphora Resolution in Hindi
Praveen Dakwale | Vandan Mujadia | Dipti M Sharma
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Exploring Verb Frames for Sentence Simplification in Hindi
Ankush Soni | Sambhav Jain | Dipti Misra Sharma
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Anaphora Annotation in Hindi Dependency TreeBank
Praveen Dakwale | Himanshu Sharma | Dipti M Sharma
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
Hindi Derivational Morphological Analyzer
Nikhil Kanuparthi | Abhilash Inumella | Dipti Misra Sharma
Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

pdf
Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank
Prudhvi Kosaraju | Bharat Ram Ambati | Samar Husain | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the Sixth Linguistic Annotation Workshop

pdf
Dependency Treebank of Urdu and its Evaluation
Riyaz Ahmad Bhat | Dipti Misra Sharma
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages
Dipti Misra Sharma | Prashanth Mannem | Joseph vanGenabith | Sobha Lalitha Devi | Radhika Mamidi | Ranjani Parthasarathi
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

pdf abs
Evaluation of Discourse Relation Annotation in the Hindi Discourse Relation Bank
Sudheer Kolachina | Rashmi Prasad | Dipti Misra Sharma | Aravind Joshi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe our experiments on evaluating recently proposed modifications to the discourse relation annotation scheme of the Penn Discourse Treebank (PDTB), in the context of annotating discourse relations in Hindi Discourse Relation Bank (HDRB). While the proposed modifications were driven by the desire to introduce greater conceptual clarity in the PDTB scheme and to facilitate better annotation quality, our findings indicate that overall, some of the changes render the annotation task much more difficult for the annotators, as also reflected in lower inter-annotator agreement for the relevant sub-tasks. Our study emphasizes the importance of best practices in annotation task design and guidelines, given that a major goal of an annotation effort should be to achieve maximally high agreement between annotators. Based on our study, we suggest modifications to the current version of the HDRB, to be incorporated in our future annotation work.

2011

pdf
Creating an Annotated Tamil Corpus as a Discourse Resource
Ravi Teja Rachakonda | Dipti Misra Sharma
Proceedings of the 5th Linguistic Annotation Workshop

pdf
Error Detection for Treebank Validation
Bharat Ram Ambati | Rahul Agarwal | Mridul Gupta | Samar Husain | Dipti Misra Sharma
Proceedings of the 9th Workshop on Asian Language Resources

2010

pdf
Improving Data Driven Dependency Parsing using Clausal Information
Phani Gadde | Karan Jindal | Samar Husain | Dipti Misra Sharma | Rajeev Sangal
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf abs
Coupling Statistical Machine Translation with Rule-based Transfer and Generation
Arafat Ahsan | Prasanth Kolachina | Sudheer Kolachina | Dipti Misra | Rajeev Sangal
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we present the insights gained from a detailed study of coupling a highly modular English-Hindi RBMT system with a standard phrase-based SMT system. Coupling the RBMT and SMT systems at various stages in the RBMT pipeline, we observe the effects of the source transformations at each stage on the performance of the coupled MT system. We propose an architecture that systematically exploits the structural transfer and robust generation capabilities of the RBMT system. Working with the English-Hindi language pair, we show that the coupling configurations explored in our experiments help address different aspects of the typological divergence between these languages. In spite of working with very small datasets, we report significant improvements both in terms of BLEU (7.14 and 0.87 over the RBMT and the SMT baselines respectively) and subjective evaluation (relative decrease of 17% in SSER).

pdf
Two Methods to Incorporate ’Local Morphosyntactic’ Features in Hindi Dependency Parsing
Bharat Ram Ambati | Samar Husain | Sambhav Jain | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

pdf
On the Role of NLP in Linguistics
Dipti Misra Sharma
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf
A Preliminary Work on Hindi Causatives
Rafiya Begum | Dipti Misra Sharma
Proceedings of the Eighth Workshop on Asian Language Resouces

We are in the process of creating a multi-representational and multi-layered treebank for Hindi/Urdu (Palmer et al., 2009), which has three main layers: dependency structure, predicate-argument structure (PropBank), and phrase structure. This paper discusses an important issue in treebank design which is often neglected: the use of empty categories (ECs). All three levels of representation make use of ECs. We make a high-level distinction between two types of ECs, trace and silent, on the basis of whether they are postulated to mark displacement or not. Each type is further refined into several subtypes based on the underlying linguistic phenomena which the ECs are introduced to handle. This paper discusses the stages at which we add ECs to the Hindi/Urdu treebank and why. We investigate methodically the different types of ECs and their role in our syntactic and semantic representations. We also examine our decisions whether or not to coindex each type of ECs with other elements in the representation.

pdf abs
A High Recall Error Identification Tool for Hindi Treebank Validation
Bharat Ram Ambati | Mridul Gupta | Samar Husain | Dipti Misra Sharma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a hybrid tool for a semi-automated process for validation of treebank annotation at various levels. The tool is developed for error detection at the part-of-speech, chunk and dependency levels of a Hindi treebank, currently under development. The tool aims to identify as many errors as possible at these levels to achieve consistency in the task of annotation. Consistency in treebank annotation is a must for making data as error-free as possible and for providing quality assurance. The tool is aimed at ensuring consistency and to make manual validation cost effective. We discuss a rule based and a hybrid approach (statistical methods combined with rule-based methods) by which a high-recall system can be developed and used to identify errors in the treebank. We report some results of using the tool on a sample of data extracted from the Hindi treebank. We also argue how the tool can prove useful in improving the annotation guidelines which would in turn, better the quality of annotation in subsequent iterations.

pdf abs
Partial Parsing as a Method to Expedite Dependency Annotation of a Hindi Treebank
Mridul Gupta | Vineet Yadav | Samar Husain | Dipti Misra Sharma
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper describes an approach to expedite the process of manual annotation of a Hindi dependency treebank which is currently under development. We propose a way by which consistency among a set of manual annotators could be improved. Furthermore, we show that our setup can also prove useful for evaluating when an inexperienced annotator is ready to start participating in the production of the treebank. We test our approach on sample sets of data obtained from an ongoing work on creation of this treebank. The results asserting our proposal are reported in this paper. We report results from a semi-automated approach of dependency annotation experiment. We find out the rate of agreement between annotators using Cohens Kappa. We also compare results with respect to the total time taken to annotate sample data-sets using a completely manual approach as opposed to a semi-automated approach. It is observed from the results that this semi-automated approach when carried out with experienced and trained human annotators improves the overall quality of treebank annotation and also speeds up the process.

2009

pdf
Constraint Based Hybrid Approach to Parsing Indian Languages
Akshar Bharati | Samar Husain | Meher Vijay | Kalyan Deepak | Dipti Misra Sharma | Rajeev Sangal
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf
The Hindi Discourse Relation Bank
Umangi Oza | Rashmi Prasad | Sudheer Kolachina | Dipti Misra Sharma | Aravind Joshi
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
Simple Parser for Indian Languages in a Dependency Framework
Akshar Bharati | Mridul Gupta | Vineet Yadav | Karthik Gali | Dipti Misra Sharma
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf
Two stage constraint based hybrid approach to free word order language dependency parsing
Akshar Bharati | Samar Husain | Dipti Misra | Rajeev Sangal
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf
Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition
Karthik Gali | Harshit Surana | Ashwini Vaidya | Praneeth Shishtla | Dipti Misra Sharma
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

pdf
Towards an Annotated Corpus of Discourse Relations in Hindi
Rashmi Prasad | Samar Husain | Dipti Sharma | Aravind Joshi
Proceedings of the 6th Workshop on Asian Language Resources

pdf abs
Developing Verb Frames for Hindi
Rafiya Begum | Samar Husain | Lakshmi Bai | Dipti Misra Sharma
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper introduces an ongoing work on developing verb frames for Hindi. Verb frames capture syntactic commonalities of semantically related verbs. The main objective of this work is to create a linguistic resource which will prove to be indispensable for various NLP applications. We also hope this resource to help us better understand Hindi verbs. We motivate the basic verb argument structure using relations as introduced by Panini. We show the methodology used in preparing these frames and the criteria followed for classifying Hindi verbs.