Adam Przepiórkowski

2024

pdf abs
An Argument for Symmetric Coordination from Dependency Length Minimization: A Replication Study
Adam Przepiórkowski | Magdalena Borysiak | Adam Głowacki
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

It is well known that left conjuncts tend to be shorter in English coordinate structures. On the basis of Penn Treebank, Przepiórkowski and Woźniak 2023 (in ACL 2023 proceedings) show that this tendency depends on the difference between lengths of conjuncts: the larger the difference, the stronger the tendency for the shorter conjunct to occur on the left. However, this dynamics is observed only when the governor of the coordinate structure is on the left of the coordination (e.g., “Bring apples and oranges!”) or when it is absent (e.g., “Come and sing!”), and not when it is on the right (e.g., “Apples and oranges fell”). Given the principle of Dependency Length Minimization, this turns out to provide an argument for the symmetric structure of coordination. We replicate and sharpen this result on the basis of a much larger dataset: parts of the COCA corpus parsed with Stanza. We also investigate the dependence of this result on the assumed unit of length (word vs. character) and on genre.

2023

pdf abs
Conjunct Lengths in English, Dependency Length Minimization, and Dependency Structure of Coordination
Adam Przepiórkowski | Michał Woźniak
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper confirms that, in English binary coordinations, left conjuncts tend to be shorter than right conjuncts, regardless of the position of the governor of the coordination. We demonstrate that this tendency becomes stronger when length differences are greater, but only when the governor is on the left or absent, not when it is on the right. We explain this effect via Dependency Length Minimization and we show that this explanation provides support for symmetrical dependency structures of coordination (where coordination is multi-headed by all conjuncts, as in Word Grammar or in enhanced Universal Dependencies, or where it single-headed by the conjunction, as in the Prague Dependency Treebank), as opposed to asymmetrical structures (where coordination is headed by the first conjunct, as in the Meaning–Text Theory or in basic Universal Dependencies).

2021

pdf abs
Comparing learnability of two dependency schemes: ‘semantic’ (UD) and ‘syntactic’ (SUD)
Ryszard Tuora | Adam Przepiórkowski | Aleksander Leczkowski
Findings of the Association for Computational Linguistics: EMNLP 2021

This paper contributes to the thread of research on the learnability of different dependency annotation schemes: one (‘semantic’) favouring content words as heads of dependency relations and the other (‘syntactic’) favouring syntactic heads. Several studies have lent support to the idea that choosing syntactic criteria for assigning heads in dependency trees improves the performance of dependency parsers. This may be explained by postulating that syntactic approaches are generally more learnable. In this study, we test this hypothesis by comparing the performance of five parsing systems (both transition- and graph-based) on a selection of 21 treebanks, each in a ‘semantic’ variant, represented by standard UD (Universal Dependencies), and a ‘syntactic’ variant, represented by SUD (Surface-syntactic Universal Dependencies): unlike previously reported experiments, which considered learnability of ‘semantic’ and ‘syntactic’ annotations of particular constructions in vitro, the experiments reported here consider whole annotation schemes in vivo. Additionally, we compare these annotation schemes using a range of quantitative syntactic properties, which may also reflect their learnability. The results of the experiments show that SUD tends to be more learnable than UD, but the advantage of one or the other scheme depends on the parser and the corpus in question.

2019

pdf
Coordination of Unlike Grammatical Functions
Agnieszka Patejuk | Adam Przepiórkowski
Proceedings of the Fifth International Conference on Dependency Linguistics (Depling, SyntaxFest 2019)

pdf bib
SyntaxFest 2019 Invited talk - Arguments and adjuncts
Adam Przepiórkowski
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

pdf
Nested Coordination in Universal Dependencies
Adam Przepiórkowski | Agnieszka Patejuk
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2018

pdf bib abs
From Lexical Functional Grammar to Enhanced Universal Dependencies
Adam Przepiórkowski | Agnieszka Patejuk
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This is a summary of an invited talk.

pdf abs
Arguments and Adjuncts in Universal Dependencies
Adam Przepiórkowski | Agnieszka Patejuk
Proceedings of the 27th International Conference on Computational Linguistics

The aim of this paper is to argue for a coherent Universal Dependencies approach to the core vs. non-core distinction. We demonstrate inconsistencies in the current version 2 of UD in this respect – mostly resulting from the preservation of the argument–adjunct dichotomy despite the declared avoidance of this distinction – and propose a relatively conservative modification of UD that is free from these problems.

2014

pdf
Extended phraseological information in a valence dictionary for NLP applications
Adam Przepiórkowski | Elżbieta Hajnicz | Agnieszka Patejuk | Marcin Woliński
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

pdf
Semantic Roles in Grammar Engineering
Wojciech Jaworski | Adam Przepiórkowski
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf abs
Walenty: Towards a comprehensive valence dictionary of Polish
Adam Przepiórkowski | Elżbieta Hajnicz | Agnieszka Patejuk | Marcin Woliński | Filip Skwarski | Marek Świdziński
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents Walenty, a comprehensive valence dictionary of Polish, with a number of novel features, as compared to other such dictionaries. The notion of argument is based on the coordination test and takes into consideration the possibility of diverse morphosyntactic realisations. Some aspects of the internal structure of phraseological (idiomatic) arguments are handled explicitly. While the current version of the dictionary concentrates on syntax, it already contains some semantic features, including semantically defined arguments, such as locative, temporal or manner, as well as control and raising, and work on extending it with semantic roles and selectional preferences is in progress. Although Walenty is still being intensively developed, it is already by far the largest Polish valence dictionary, with around 8600 verbal lemmata and almost 39 000 valence schemata. The dictionary is publicly available on the Creative Commons BY SA licence and may be downloaded from http://zil.ipipan.waw.pl/Walenty.

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiatives work throughout Europe in order to boost progress and innovation in our field.

pdf abs
Projection-based Annotation of a Polish Dependency Treebank
Alina Wróblewska | Adam Przepiórkowski
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents an approach of automatic annotation of sentences with dependency structures. The approach builds on the idea of cross-lingual dependency projection. The presented method of acquiring dependency trees involves a weighting factor in the processes of projecting source dependency relations to target sentences and inducing well-formed target dependency trees from sets of projected dependency relations. Using a parallel corpus, source trees are transferred onto equivalent target sentences via an extended set of alignment links. Projected arcs are initially weighted according to the certainty of word alignment links. Then, arc weights are recalculated using a method based on the EM selection algorithm. Maximum spanning trees selected from EM-scored digraphs and labelled with appropriate grammatical functions constitute a target dependency treebank. Extrinsic evaluation shows that parsers trained on such a treebank may perform comparably to parsers trained on a manually developed treebank.

2013

2012

pdf abs
Towards a comprehensive open repository of Polish language resources
Maciej Ogrodniczuk | Piotr Pęzik | Adam Przepiórkowski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland site containing an exhaustive collection of Polish LRTs. Current work is focused on the creation of new LRTs and, esp., the enhancement of existing LRTs, such as parallel corpora, annotated corpora of written and spoken Polish and morphological dictionaries to be made available via the META-SHARE repository.

pdf abs
Towards an LFG parser for Polish: An exercise in parasitic grammar development
Agnieszka Patejuk | Adam Przepiórkowski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

While it is possible to build a formal grammar manually from scratch or, going to another extreme, to derive it automatically from a treebank, the development of the LFG grammar of Polish presented in this paper is different from both of these methods as it relies on extensive reuse of existing language resources for Polish. LFG grammars minimally provide two levels of representation: constituent structure (c-structure) produced by context-free phrase structure rules and functional structure (f-structure) created by functional descriptions. The c-structure was based on a DCG grammar of Polish, while the f-structure level was mainly inspired by the available HPSG analyses of Polish. The morphosyntactic information needed to create a lexicon may be taken from one of the following resources: a morphological analyser, a treebank or a corpus. Valence information from the dictionary which accompanies the DCG grammar was converted so that subcategorisation is stated in terms of grammatical functions rather than categories; additionally, missing valence frames may be extracted from the treebank. The obtained grammar is evaluated using constructed testsuites (half of which were provided by previous grammars) and the treebank.

pdf abs
PoliMorf: a (not so) new open morphological dictionary for Polish
Marcin Woliński | Marcin Miłkowski | Maciej Ogrodniczuk | Adam Przepiórkowski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents preliminary results of an effort aiming at the creation of a morphological dictionary of Polish, PoliMorf, available under a very liberal BSD-style license. The dictionary is a result of a merger of two existing resources, SGJP and Morfologik and was prepared within the CESAR/META-NET initiative. The work completed so far includes re-licensing of the two dictionaries and filling the new resource with the morphological data semi-automatically unified from both sources. The merging process is controlled by the collaborative dictionary development web application Kuźnia, also implemented within the project. The tool involves several advanced features such as using SGJP inflectional patterns for form generation, possibility of attaching dictionary labels and classification schemes to lexemes, dictionary source record and change tracking. Since SGJP and Morfologik are already used in a significant number of Natural Language Processing projects in Poland, we expect PoliMorf to become the Polish morphological dictionary of choice for many years to come.

pdf
Machine Learning of Syntactic Attachment from Morphosyntactic and Semantic Co-occurrence Statistics
Szymon Acedański | Adam Slaski | Adam Przepiórkowski
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages

pdf
Simultaneous error detection at two levels of syntactic annotation
Adam Przepiórkowski | Michał Lenart
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
Harnessing NLP Techniques in the Processes of Multilingual Content Management
Anelia Belogay | Diman Karagyozov | Svetla Koeva | Cristina Vertan | Adam Przepiórkowski | Dan Cristea | Plovios Raxis
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf
A Comprehensive Analysis of Constituent Coordination for Grammar Engineering
Agnieszka Patejuk | Adam Przepiórkowski
Proceedings of COLING 2012

2010

pdf abs
Recent Developments in the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Marek Łaziński | Piotr Pęzik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.

pdf abs
The Design of Syntactic Annotation Levels in the National Corpus of Polish
Katarzyna Głowińska | Adam Przepiórkowski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The paper presents the procedure of syntactic annotation of the National Corpus of Polish. The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinuous phrases and syntactic words. It includes the complete tagset for syntactic words and the list of syntactic groups recognized in NKJP. The tagset defines grammatical classes and categories according to morphosyntactic and syntactic criteria only. Syntactic annotation in the National Corpus of Polish is limited to making constituents of combinations of words. Annotation depends on shallow parsing and manual post-editing of the results by annotators. Manual annotation is performed by two independents annotators, with a referee in cases of disagreement. The manually constructed grammar, both for syntactic words and for syntactic groups, is encoded in the shallow parsing system Spejd.

pdf abs
Towards the Annotation of Named Entities in the National Corpus of Polish
Agata Savary | Jakub Waszczuk | Adam Przepiórkowski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the named entity annotation task within the on-going project of the National Corpus of Polish. To the best of our knowledge, this is the first attempt at a large-scale corpus annotation of Polish named entities. We describe the scope and the TEI-inspired hierarchy of named entities admitted for this task, as well as the TEI-conformant multi-level stand-off annotation format. We also discuss some methodological strategies including the annotation of embedded, coordinated and discontinuous names. Our annotation platform consists of two main tools interconnected by converting facilities. A rule-based natural language processing platform SProUT is used for the automatic pre-annotation of named entities, due to the previously created Polish extraction grammars adapted to the annotation task. A customizable graphical tree editor TrEd, extended to our needs, provides an ergonomic environment for manual correction of annotations. Despite some difficult cases encountered in the early annotation phase, about 2,600 named entities in 1,800 corpus sentences have presently been annotated, which allowed to validate the project methodology and tools.

pdf bib
Towards the Adequate Evaluation of Morphosyntactic Taggers
Szymon Acedański | Adam Przepiórkowski
Coling 2010: Posters

2009

pdf
Stand-off TEI Annotation: the Case of the National Corpus of Polish
Piotr Bański | Adam Przepiórkowski
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf abs
Definition Extraction Using a Sequential Combination of Baseline Grammars and Machine Learning Classifiers
Łukasz Degórski | Michał Marcińczuk | Adam Przepiórkowski
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar, and much better than pure ML approach.

pdf abs
Towards the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Barbara Lewandowska-Tomaszyk | Marek Łaziński
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a new corpus project, aiming at building a national corpus of Polish. What makes it different from a typical YACP (Yet Another Corpus Project) is 1) the fact that all four partners in the project have in the past constructed corpora of Polish, sometimes in the spirit of collaboration, at other times - in the spirit of competition, 2) the partners bring into the project varying areas of expertise and experience, so the synergy effect is anticipated, 3) the corpus will be built with an eye on specific applications in various fields, including lexicography (the corpus will be the empirical basis of a new large general dictionary of Polish) and natural language processing (a number of NLP tools will be constructed within the project).

pdf abs
♠ Demo: An Open Source Tool for Partial Parsing and Morphosyntactic Disambiguation
Aleksander Buczyński | Adam Przepiórkowski
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The paper presents Spejd, an Open Source Shallow Parsing and Disambiguation Engine. Spejd (abbreviated to ♠) is based on a fully uniform formalism both for constituency partial parsing and for morphosyntactic disambiguation - the same grammar rule may contain structure-building operations, as well as morphosyntactic correction and disambiguation operations. The formalism and the engine are more flexible than either the usual shallow parsing formalisms, which assume disambiguated input, or the usual unification-based formalisms, which couple disambiguation (via unification) with structure building. Current applications of Spejd include rule-based disambiguation, detection of multiword expressions, valence acquisition, and sentiment analysis. The functionality can be further extended by adding external lexical resources. While the examples are based on the set of rules prepared for the parsing of the IPI PAN Corpus of Polish, ♠ is fully language-independent and we hope it will also be useful in the processing of other languages.