2024
pdf
abs
Revisiting VMWEs in Hindi: Annotating Layers of Predication
Kanishka Jain
|
Ashwini Vaidya
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Multiword expressions in languages like Hindi are both productive and challenging. Hindi not only uses a variety of verbal multiword expressions (VMWEs) but also employs different combinatorial strategies to create new types of multiword expressions. In this paper we are investigating two such strategies that are quite common in the language. Firstly, we describe that VMWEs in Hindi are not just lexical but also morphological. Causatives are formed morphologically in Hindi. Second, we examine Stacked VMWEs i.e. when at least two VMWEs occur together. We suggest that the existing PARSEME annotation framework can be extended to these two phenomena without changing the existing guidelines. We also propose rule-based heuristics using existing Universal Dependency annotations to automatically identify and annotate some of the VMWEs in the language. The goal of this paper is to refine the existing PARSEME corpus of Hindi for VMWEs while expanding its scope giving a more comprehensive picture of VMWEs in Hindi.
2021
pdf
abs
Fine-tuning Distributional Semantic Models for Closely-Related Languages
Kushagra Bhatia
|
Divyanshu Aggarwal
|
Ashwini Vaidya
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
In this paper we compare the performance of three models: SGNS (skip-gram negative sampling) and augmented versions of SVD (singular value decomposition) and PPMI (Positive Pointwise Mutual Information) on a word similarity task. We particularly focus on the role of hyperparameter tuning for Hindi based on recommendations made in previous work (on English). Our results show that there are language specific preferences for these hyperparameters. We extend the best settings for Hindi to a set of related languages: Punjabi, Gujarati and Marathi with favourable results. We also find that a suitably tuned SVD model outperforms SGNS for most of our languages and is also more robust in a low-resource setting.
pdf
bib
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)
Paul Cook
|
Jelena Mitrović
|
Carla Parra Escartín
|
Ashwini Vaidya
|
Petya Osenova
|
Shiva Taslimipoor
|
Carlos Ramisch
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)
2020
pdf
bib
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
Stella Markantonatou
|
John McCrae
|
Jelena Mitrović
|
Carole Tiberius
|
Carlos Ramisch
|
Ashwini Vaidya
|
Petya Osenova
|
Agata Savary
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
pdf
abs
Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Agata Savary
|
Bruno Guillaume
|
Jakub Waszczuk
|
Marie Candito
|
Ashwini Vaidya
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Uxoa Iñurrieta
|
Voula Giouli
|
Tunga Güngör
|
Menghan Jiang
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Renata Ramisch
|
Sara Stymne
|
Abigail Walsh
|
Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.
2019
pdf
abs
Towards measuring lexical complexity in Malayalam
Richard Shallam
|
Ashwini Vaidya
Proceedings of the 16th International Conference on Natural Language Processing
This paper proposes a metric to quantify lexical complexity in Malayalam. The met- ric utilizes word frequency, orthography and morphology as the three factors affect- ing visual word recognition in Malayalam. Malayalam differs from other Indian lan- guages due to its agglutinative morphology and orthography, which are incorporated into our model. The predictions made by our model are then evaluated against reac- tion times in a lexical decision task. We find that reaction times are predicted by frequency, morphological complexity and script complexity. We also explore the interactions between morphological com- plexity with frequency and script in our results. To the best of our knowledge, this is the first study on lexical complexity in Malayalam.
pdf
bib
abs
Syntactic composition and selectional preferences in Hindi Light Verb Constructions
Ashwini Vaidya
|
Martha Palmer
Linguistic Issues in Language Technology, Volume 17, 2019
Previous work on light verb constructions (e.g. chorii kar ‘theft do; steal’) in Hindi describes their syntactic formation via co-predication (Ahmed et al., 2012, Butt, 2014). This implies that both noun and light verb contribute their arguments, and these overlapping argument structures must be composed in the syntax. In this paper, we present a co-predication analysis using Tree-Adjoining Grammar, which models syntactic composition and semantic selectional preferences without transformations (deletion or argument identification). The analysis has two key components (i) an underspecified category for the nominal and (ii) combinatorial constraints on the noun and light verb to specify selectional preferences. The former has the advantage of syntactic composition without argument identification and the latter prevents over-generalization, while recognizing the semantic contribution of both predicates. This work additionally accounts for the agreement facts for the Hindi LVC.
2018
pdf
abs
Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Silvio Ricardo Cordeiro
|
Agata Savary
|
Veronika Vincze
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Maja Buljan
|
Marie Candito
|
Polona Gantar
|
Voula Giouli
|
Tunga Güngör
|
Abdelati Hawwari
|
Uxoa Iñurrieta
|
Jolanta Kovalevskaitė
|
Simon Krek
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Carla Parra Escartín
|
Behrang QasemiZadeh
|
Renata Ramisch
|
Nathan Schneider
|
Ivelina Stoyanova
|
Ashwini Vaidya
|
Abigail Walsh
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.
2017
pdf
Understanding Constraints on Non-Projectivity Using Novel Measures
Himanshu Yadav
|
Ashwini Vaidya
|
Samar Husain
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
2016
pdf
abs
A Proposition Bank of Urdu
Maaz Anwar
|
Riyaz Ahmad Bhat
|
Dipti Sharma
|
Ashwini Vaidya
|
Martha Palmer
|
Tafseer Ahmed Khan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper describes our efforts for the development of a Proposition Bank for Urdu, an Indo-Aryan language. Our primary goal is the labeling of syntactic nodes in the existing Urdu dependency Treebank with specific argument labels. In essence, it involves annotation of predicate argument structures of both simple and complex predicates in the Treebank corpus. We describe the overall process of building the PropBank of Urdu. We discuss various statistics pertaining to the Urdu PropBank and the issues which the annotators encountered while developing the PropBank. We also discuss how these challenges were addressed to successfully expand the PropBank corpus. While reporting the Inter-annotator agreement between the two annotators, we show that the annotators share similar understanding of the annotation guidelines and of the linguistic phenomena present in the language. The present size of this Propbank is around 180,000 tokens which is double-propbanked by the two annotators for simple predicates. Another 100,000 tokens have been annotated for complex predicates of Urdu.
pdf
abs
Linguistic features for Hindi light verb construction identification
Ashwini Vaidya
|
Sumeet Agarwal
|
Martha Palmer
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Light verb constructions (LVC) in Hindi are highly productive. If we can distinguish a case such as nirnay lenaa ‘decision take; decide’ from an ordinary verb-argument combination kaagaz lenaa ‘paper take; take (a) paper’,it has been shown to aid NLP applications such as parsing (Begum et al., 2011) and machine translation (Pal et al., 2011). In this paper, we propose an LVC identification system using language specific features for Hindi which shows an improvement over previous work(Begum et al., 2011). To build our system, we carry out a linguistic analysis of Hindi LVCs using Hindi Treebank annotations and propose two new features that are aimed at capturing the diversity of Hindi LVCs in the corpus. We find that our model performs robustly across a diverse range of LVCs and our results underscore the importance of semantic features, which is in keeping with the findings for English. Our error analysis also demonstrates that our classifier can be used to further refine LVC annotations in the Hindi Treebank and make them more consistent across the board.
2014
pdf
Adapting Predicate Frames for Urdu PropBanking
Riyaz Ahmad Bhat
|
Naman Jain
|
Ashwini Vaidya
|
Martha Palmer
|
Tafseer Ahmed Khan
|
Dipti Misra Sharma
|
James Babani
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants
pdf
bib
Towards Identifying Hindi/Urdu Noun Templates in Support of a Large-Scale LFG Grammar
Sebastian Sulger
|
Ashwini Vaidya
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing
pdf
Light verb constructions with ‘do’ and ‘be’ in Hindi: A TAG analysis
Ashwini Vaidya
|
Owen Rambow
|
Martha Palmer
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
2013
pdf
Semantic Roles for Nominal Predicates: Building a Lexical Resource
Ashwini Vaidya
|
Martha Palmer
|
Bhuvana Narasimhan
Proceedings of the 9th Workshop on Multiword Expressions
2012
pdf
abs
Empty Argument Insertion in the Hindi PropBank
Ashwini Vaidya
|
Jinho D. Choi
|
Martha Palmer
|
Bhuvana Narasimhan
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper examines both linguistic behavior and practical implication of empty argument insertion in the Hindi PropBank. The Hindi PropBank is annotated on the Hindi Dependency Treebank, which contains some empty categories but not the empty arguments of verbs. In this paper, we analyze four kinds of empty arguments, *PRO*, *REL*, *GAP*, *pro*, and suggest effective ways of annotating these arguments. Empty arguments such as *PRO* and *REL* can be inserted deterministically; we present linguistically motivated rules that automatically insert these arguments with high accuracy. On the other hand, it is difficult to find deterministic rules to insert *GAP* and *pro*; for these arguments, we introduce a new annotation scheme that concurrently handles both semantic role labeling and empty category insertion, producing fast and high quality annotation. In addition, we present algorithms for finding antecedents of *REL* and *PRO*, and discuss why finding antecedents for some types of *PRO* is difficult.
2011
pdf
Analysis of the Hindi Proposition Bank using Dependency Structure
Ashwini Vaidya
|
Jinho Choi
|
Martha Palmer
|
Bhuvana Narasimhan
Proceedings of the 5th Linguistic Annotation Workshop
2010
pdf
abs
Empty Categories in a Hindi Treebank
Archna Bhatia
|
Rajesh Bhatt
|
Bhuvana Narasimhan
|
Martha Palmer
|
Owen Rambow
|
Dipti Misra Sharma
|
Michael Tepper
|
Ashwini Vaidya
|
Fei Xia
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We are in the process of creating a multi-representational and multi-layered treebank for Hindi/Urdu (Palmer et al., 2009), which has three main layers: dependency structure, predicate-argument structure (PropBank), and phrase structure. This paper discusses an important issue in treebank design which is often neglected: the use of empty categories (ECs). All three levels of representation make use of ECs. We make a high-level distinction between two types of ECs, trace and silent, on the basis of whether they are postulated to mark displacement or not. Each type is further refined into several subtypes based on the underlying linguistic phenomena which the ECs are introduced to handle. This paper discusses the stages at which we add ECs to the Hindi/Urdu treebank and why. We investigate methodically the different types of ECs and their role in our syntactic and semantic representations. We also examine our decisions whether or not to coindex each type of ECs with other elements in the representation.
pdf
PropBank Annotation of Multilingual Light Verb Constructions
Jena D. Hwang
|
Archna Bhatia
|
Claire Bonial
|
Aous Mansouri
|
Ashwini Vaidya
|
Nianwen Xue
|
Martha Palmer
Proceedings of the Fourth Linguistic Annotation Workshop
2008
pdf
Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition
Karthik Gali
|
Harshit Surana
|
Ashwini Vaidya
|
Praneeth Shishtla
|
Dipti Misra Sharma
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages