2024
pdf
bib
Context and WSD: Analysing Google Translate’s Sanskrit to English Output of Bhagavadgītā Verses for Word Meaning
Anagha Pradeep
|
Radhika Mamidi
|
Pavankumar Satuluri
Proceedings of the 7th International Sanskrit Computational Linguistics Symposium
2023
pdf
bib
Neural Approaches for Data Driven Dependency Parsing in Sanskrit
Amrith Krishna
|
Ashim Gupta
|
Deepak Garasangi
|
Jeevnesh Sandhan
|
Pavankumar Satuluri
|
Pawan Goyal
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference
pdf
abs
DepNeCTI: Dependency-based Nested Compound Type Identification for Sanskrit
Jivnesh Sandhan
|
Yaswanth Narsupalli
|
Sreevatsa Muppirala
|
Sriram Krishnan
|
Pavankumar Satuluri
|
Amba Kulkarni
|
Pawan Goyal
Findings of the Association for Computational Linguistics: EMNLP 2023
Multi-component compounding is a prevalent phenomenon in Sanskrit, and understanding the implicit structure of a compound’s components is crucial for deciphering its meaning. Earlier approaches in Sanskrit have focused on binary compounds and neglected the multi-component compound setting. This work introduces the novel task of nested compound type identification (NeCTI), which aims to identify nested spans of a multi-component compound and decode the implicit semantic relations between them. To the best of our knowledge, this is the first attempt in the field of lexical semantics to propose this task. We present 2 newly annotated datasets including an out-of-domain dataset for this task. We also benchmark these datasets by exploring the efficacy of the standard problem formulations such as nested named entity recognition, constituency parsing and seq2seq, etc. We present a novel framework named DepNeCTI: Dependency-based Nested Compound Type Identifier that surpasses the performance of the best baseline with an average absolute improvement of 13.1 points F1-score in terms of Labeled Span Score (LSS) and a 5-fold enhancement in inference efficiency. In line with the previous findings in the binary Sanskrit compound identification task, context provides benefits for the NeCTI task. The codebase and datasets are publicly available at: https://github.com/yaswanth-iitkgp/DepNeCTI
2020
pdf
abs
A Graph-Based Framework for Structured Prediction Tasks in Sanskrit
Amrith Krishna
|
Bishal Santra
|
Ashim Gupta
|
Pavankumar Satuluri
|
Pawan Goyal
Computational Linguistics, Volume 46, Issue 4 - December 2020
We propose a framework using energy-based models for multiple structured prediction tasks in Sanskrit. Ours is an arc-factored model, similar to the graph-based parsing approaches, and we consider the tasks of word segmentation, morphological parsing, dependency parsing, syntactic linearization, and prosodification, a “prosody-level” task we introduce in this work. Ours is a search-based structured prediction framework, which expects a graph as input, where relevant linguistic information is encoded in the nodes, and the edges are then used to indicate the association between these nodes. Typically, the state-of-the-art models for morphosyntactic tasks in morphologically rich languages still rely on hand-crafted features for their performance. But here, we automate the learning of the feature function. The feature function so learned, along with the search space we construct, encode relevant linguistic information for the tasks we consider. This enables us to substantially reduce the training data requirements to as low as 10%, as compared to the data requirements for the neural state-of-the-art models. Our experiments in Czech and Sanskrit show the language-agnostic nature of the framework, where we train highly competitive models for both the languages. Moreover, our framework enables us to incorporate language-specific constraints to prune the search space and to filter the candidates during inference. We obtain significant improvements in morphosyntactic tasks for Sanskrit by incorporating language-specific constraints into the model. In all the tasks we discuss for Sanskrit, we either achieve state-of-the-art results or ours is the only data-driven solution for those tasks.
pdf
abs
Keep it Surprisingly Simple: A Simple First Order Graph Based Parsing Model for Joint Morphosyntactic Parsing in Sanskrit
Amrith Krishna
|
Ashim Gupta
|
Deepak Garasangi
|
Pavankumar Satuluri
|
Pawan Goyal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Morphologically rich languages seem to benefit from joint processing of morphology and syntax, as compared to pipeline architectures. We propose a graph-based model for joint morphological parsing and dependency parsing in Sanskrit. Here, we extend the Energy based model framework (Krishna et al., 2020), proposed for several structured prediction tasks in Sanskrit, in 2 simple yet significant ways. First, the framework’s default input graph generation method is modified to generate a multigraph, which enables the use of an exact search inference. Second, we prune the input search space using a linguistically motivated approach, rooted in the traditional grammatical analysis of Sanskrit. Our experiments show that the morphological parsing from our joint model outperforms standalone morphological parsers. We report state of the art results in morphological parsing, and in dependency parsing, both in standalone (with gold morphological tags) and joint morphosyntactic parsing setting.
pdf
Dependency Relations for Sanskrit Parsing and Treebank
Amba Kulkarni
|
Pavankumar Satuluri
|
Sanjeev Panchal
|
Malay Maity
|
Amruta Malvade
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories
2019
pdf
abs
Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A Case for Low-Resource Languages
Amrith Krishna
|
Vishnu Sharma
|
Bishal Santra
|
Aishik Chakraborty
|
Pavankumar Satuluri
|
Pawan Goyal
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
The word ordering in a Sanskrit verse is often not aligned with its corresponding prose order. Conversion of the verse to its corresponding prose helps in better comprehension of the construction. Owing to the resource constraints, we formulate this task as a word ordering (linearisation) task. In doing so, we completely ignore the word arrangement at the verse side. kāvya guru, the approach we propose, essentially consists of a pipeline of two pretraining steps followed by a seq2seq model. The first pretraining step learns task-specific token embeddings from pretrained embeddings. In the next step, we generate multiple possible hypotheses for possible word arrangements of the input %using another pretraining step. We then use them as inputs to a neural seq2seq model for the final prediction. We empirically show that the hypotheses generated by our pretraining step result in predictions that consistently outperform predictions based on the original order in the verse. Overall, kāvya guru outperforms current state of the art models in linearisation for the poetry to prose conversion task in Sanskrit.
2018
pdf
abs
Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit
Amrith Krishna
|
Bishal Santra
|
Sasi Prasanth Bandaru
|
Gaurav Sahu
|
Vishnu Dutt Sharma
|
Pavankumar Satuluri
|
Pawan Goyal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
The configurational information in sentences of a free word order language such as Sanskrit is of limited use. Thus, the context of the entire sentence will be desirable even for basic processing tasks such as word segmentation. We propose a structured prediction framework that jointly solves the word segmentation and morphological tagging tasks in Sanskrit. We build an energy based model where we adopt approaches generally employed in graph based parsing techniques (McDonald et al., 2005a; Carreras, 2007). Our model outperforms the state of the art with an F-Score of 96.92 (percentage improvement of 7.06%) while using less than one tenth of the task-specific training data. We find that the use of a graph based approach instead of a traditional lattice-based sequential labelling approach leads to a percentage gain of 12.6% in F-Score for the segmentation task.
2017
pdf
abs
A Dataset for Sanskrit Word Segmentation
Amrith Krishna
|
Pavan Kumar Satuluri
|
Pawan Goyal
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.
pdf
abs
A Graph Based Semi-Supervised Approach for Analysis of Derivational Nouns in Sanskrit
Amrith Krishna
|
Pavankumar Satuluri
|
Harshavardhan Ponnada
|
Muneeb Ahmed
|
Gulab Arora
|
Kaustubh Hiware
|
Pawan Goyal
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing
Derivational nouns are widely used in Sanskrit corpora and represent an important cornerstone of productivity in the language. Currently there exists no analyser that identifies the derivational nouns. We propose a semi supervised approach for identification of derivational nouns in Sanskrit. We not only identify the derivational words, but also link them to their corresponding source words. Our novelty comes in the design of the network structure for the task. The edge weights are featurised based on the phonetic, morphological, syntactic and the semantic similarity shared between the words to be identified. We find that our model is effective for the task, even when we employ a labelled dataset which is only 5 % to that of the entire dataset.
2016
pdf
abs
Word Segmentation in Sanskrit Using Path Constrained Random Walks
Amrith Krishna
|
Bishal Santra
|
Pavankumar Satuluri
|
Sasi Prasanth Bandaru
|
Bhumi Faldu
|
Yajuvendra Singh
|
Pawan Goyal
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
In Sanskrit, the phonemes at the word boundaries undergo changes to form new phonemes through a process called as sandhi. A fused sentence can be segmented into multiple possible segmentations. We propose a word segmentation approach that predicts the most semantically valid segmentation for a given sentence. We treat the problem as a query expansion problem and use the path-constrained random walks framework to predict the correct segments.
pdf
bib
abs
Compound Type Identification in Sanskrit: What Roles do the Corpus and Grammar Play?
Amrith Krishna
|
Pavankumar Satuluri
|
Shubham Sharma
|
Apurv Kumar
|
Pawan Goyal
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
We propose a classification framework for semantic type identification of compounds in Sanskrit. We broadly classify the compounds into four different classes namely, Avyayībhāva, Tatpuruṣa, Bahuvrīhi and Dvandva. Our classification is based on the traditional classification system followed by the ancient grammar treatise Adṣṭādhyāyī, proposed by Pāṇini 25 centuries back. We construct an elaborate features space for our system by combining conditional rules from the grammar Adṣṭādhyāyī, semantic relations between the compound components from a lexical database Amarakoṣa and linguistic structures from the data using Adaptor Grammars. Our in-depth analysis of the feature space highlight inadequacy of Adṣṭādhyāyī, a generative grammar, in classifying the data samples. Our experimental results validate the effectiveness of using lexical databases as suggested by Amba Kulkarni and Anil Kumar, and put forward a new research direction by introducing linguistic patterns obtained from Adaptor grammars for effective identification of compound type. We utilise an ensemble based approach, specifically designed for handling skewed datasets and we %and Experimenting with various classification methods, we achieve an overall accuracy of 0.77 using random forest classifiers.