2023
pdf
abs
PARSEME corpus release 1.3
Agata Savary
|
Cherifa Ben Khelil
|
Carlos Ramisch
|
Voula Giouli
|
Verginica Barbu Mititelu
|
Najet Hadj Mohamed
|
Cvetana Krstev
|
Chaya Liebeskind
|
Hongzhi Xu
|
Sara Stymne
|
Tunga Güngör
|
Thomas Pickard
|
Bruno Guillaume
|
Eduard Bejček
|
Archna Bhatia
|
Marie Candito
|
Polona Gantar
|
Uxoa Iñurrieta
|
Albert Gatt
|
Jolanta Kovalevskaite
|
Timm Lichte
|
Nikola Ljubešić
|
Johanna Monti
|
Carla Parra Escartín
|
Mehrnoush Shamsfard
|
Ivelina Stoyanova
|
Veronika Vincze
|
Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
2020
pdf
abs
Prague Dependency Treebank - Consolidated 1.0
Jan Hajič
|
Eduard Bejček
|
Jaroslava Hlavacova
|
Marie Mikulová
|
Milan Straka
|
Jan Štěpánek
|
Barbora Štěpánková
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.
2018
pdf
ForFun 1.0: Prague Database of Forms and Functions – An Invaluable Resource for Linguistic Research
Marie Mikulová
|
Eduard Bejček
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
The Relation of Form and Function in Linguistic Theory and in a Multilayer Treebank
Eduard Bejček
|
Eva Hajičová
|
Marie Mikulová
|
Jarmila Panevová
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
2016
pdf
abs
Distribution of Valency Complements in Czech Complex Predicates: Between Verb and Noun
Václava Kettnerová
|
Eduard Bejček
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we focus on Czech complex predicates formed by a light verb and a predicative noun expressed as the direct object. Although Czech ― as an inflectional language encoding syntactic relations via morphological cases ― provides an excellent opportunity to study the distribution of valency complements in the syntactic structure with complex predicates, this distribution has not been described so far. On the basis of a manual analysis of the richly annotated data from the Prague Dependency Treebank, we thus formulate principles governing this distribution. In an automatic experiment, we verify these principles on well-formed syntactic structures from the Prague Dependency Treebank and the Prague Czech-English Dependency Treebank with very satisfactory results: the distribution of 97% of valency complements in the surface structure is governed by the proposed principles. These results corroborate that the surface structure formation of complex predicates is a regular process.
pdf
abs
MWEs in Treebanks: From Survey to Guidelines
Victoria Rosén
|
Koenraad De Smedt
|
Gyri Smørdal Losnegaard
|
Eduard Bejček
|
Agata Savary
|
Petya Osenova
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
By means of an online survey, we have investigated ways in which various types of multiword expressions are annotated in existing treebanks. The results indicate that there is considerable variation in treatments across treebanks and thereby also, to some extent, across languages and across theoretical frameworks. The comparison is focused on the annotation of light verb constructions and verbal idioms. The survey shows that the light verb constructions either get special annotations as such, or are treated as ordinary verbs, while VP idioms are handled through different strategies. Based on insights from our investigation, we propose some general guidelines for annotating multiword expressions in treebanks. The recommendations address the following application-based needs: distinguishing MWEs from similar but compositional constructions; searching distinct types of MWEs in treebanks; awareness of literal and nonliteral meanings; and normalization of the MWE representation. The cross-lingually and cross-theoretically focused survey is intended as an aid to accessing treebanks and an aid for further work on treebank annotation.
pdf
Inherently Pronominal Verbs in Czech: Description and Conversion Based on Treebank Annotation
Zdeňka Urešová
|
Eduard Bejček
|
Jan Hajič
Proceedings of the 12th Workshop on Multiword Expressions
2014
pdf
abs
Automatic Mapping Lexical Resources: A Lexical Unit as the Keystone
Eduard Bejček
|
Václava Kettnerová
|
Markéta Lopatková
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents the fully automatic linking of two valency lexicons of Czech verbs: VALLEX and PDT-VALLEX. Despite the same theoretical background adopted by these lexicons and the same linguistic phenomena they focus on, the fully automatic mapping of these resouces is not straightforward. We demonstrate that converting these lexicons into a common format represents a relatively easy part of the task whereas the automatic identification of pairs of corresponding valency frames (representing lexical units of verbs) poses difficulties. The overall achieved precision of 81% can be considered satisfactory. However, the higher number of lexical units a verb has, the lower the precision of their automatic mapping usually is. Moreover, we show that especially (i) supplementing further information on lexical units and (ii) revealing and reconciling regular discrepancies in their annotations can greatly assist in the automatic merging.
2013
pdf
Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures
Eduard Bejček
|
Pavel Straňák
|
Pavel Pecina
Proceedings of the 9th Workshop on Multiword Expressions
2012
pdf
Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0
Eduard Bejček
|
Jarmila Panevová
|
Jan Popelka
|
Pavel Straňák
|
Magda Ševčíková
|
Jan Štěpánek
|
Zdeněk Žabokrtský
Proceedings of COLING 2012
2008
pdf
Annotation of Multiword Expressions in the Prague Dependency Treebank
Eduard Bejček
|
Pavel Straňák
|
Pavel Schlesinger
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II