What do phone embeddings learn about Phonology?
Sudheer Kolachina
Lilla Magyar
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
Recent work has looked at evaluation of phone embeddings using sound analogies and correlations between distinctive feature space and embedding space. It has not been clear what aspects of natural language phonology are learnt by neural network inspired distributed representational models such as word2vec. To study the kinds of phonological relationships learnt by phone embeddings, we present artificial phonology experiments that show that phone embeddings learn paradigmatic relationships such as phonemic and allophonic distribution quite well. They are also able to capture co-occurrence restrictions among vowels such as those observed in languages with vowel harmony. However, they are unable to learn co-occurrence restrictions among the class of consonants.
Evaluation of Discourse Relation Annotation in the Hindi Discourse Relation Bank
Sudheer Kolachina
Rashmi Prasad
Dipti Misra Sharma
Aravind Joshi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We describe our experiments on evaluating recently proposed modifications to the discourse relation annotation scheme of the Penn Discourse Treebank (PDTB), in the context of annotating discourse relations in Hindi Discourse Relation Bank (HDRB). While the proposed modifications were driven by the desire to introduce greater conceptual clarity in the PDTB scheme and to facilitate better annotation quality, our findings indicate that overall, some of the changes render the annotation task much more difficult for the annotators, as also reflected in lower inter-annotator agreement for the relevant sub-tasks. Our study emphasizes the importance of best practices in annotation task design and guidelines, given that a major goal of an annotation effort should be to achieve maximally high agreement between annotators. Based on our study, we suggest modifications to the current version of the HDRB, to be incorporated in our future annotation work.
Parsing Any Domain English text to CoNLL dependencies
Sudheer Kolachina
Prasanth Kolachina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
It is well known that accuracies of statistical parsers trained over Penn Treebank on test sets drawn from the same corpus tend to be overestimates of their actual parsing performance. This gives rise to the need for evaluation of parsing performance on corpora from different domains. Evaluating multiple parsers on test sets from different domains can give a detailed picture about the relative strengths/weaknesses of different parsing approaches. Such information is also necessary to guide choice of parser in applications such as machine translation where text from multiple domains needs to be handled. In this paper, we report a benchmarking study of different state-of-art parsers for English, both constituency and dependency. The constituency parser output is converted into CoNLL-style dependency trees so that parsing performance can be compared across formalisms. Specifically, we train rerankers for Berkeley and Stanford parsers to study the usefulness of reranking for handling texts from different domains. The results of our experiments lead to interesting insights about the out-of-domain performance of different English parsers.
Grammar Extraction from Treebanks for Hindi and Telugu
Prasanth Kolachina
Sudheer Kolachina
Anil Kumar Singh
Samar Husain
Viswanath Naidu
Rajeev Sangal
Akshar Bharati
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Grammars play an important role in many Natural Language Processing (NLP) applications. The traditional approach to creating grammars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated treebanks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian languages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we show that the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.
Phrase Based Decoding using a Discriminative Model
Prasanth Kolachina
Sriram Venkatapathy
Srinivas Bangalore
Sudheer Kolachina
Avinesh PVS
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation
Coupling Statistical Machine Translation with Rule-based Transfer and Generation
Arafat Ahsan
Prasanth Kolachina
Sudheer Kolachina
Dipti Misra
Rajeev Sangal
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
In this paper, we present the insights gained from a detailed study of coupling a highly modular English-Hindi RBMT system with a standard phrase-based SMT system. Coupling the RBMT and SMT systems at various stages in the RBMT pipeline, we observe the effects of the source transformations at each stage on the performance of the coupled MT system. We propose an architecture that systematically exploits the structural transfer and robust generation capabilities of the RBMT system. Working with the English-Hindi language pair, we show that the coupling configurations explored in our experiments help address different aspects of the typological divergence between these languages. In spite of working with very small datasets, we report significant improvements both in terms of BLEU (7.14 and 0.87 over the RBMT and the SMT baselines respectively) and subjective evaluation (relative decrease of 17% in SSER).
The Hindi Discourse Relation Bank
Umangi Oza
Rashmi Prasad
Sudheer Kolachina
Dipti Misra Sharma
Aravind Joshi
Proceedings of the Third Linguistic Annotation Workshop (LAW III)
Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training
Taraka Rama
Anil Kumar Singh
Sudheer Kolachina
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium