Jakub Waszczuk

Also published as: Jakub Wasczuk

2022

This paper considers the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.

2021

pdf bib
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)
Kilian Evang | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk | Torsten Zesch
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf abs
Polish corpus of verbal multiword expressions
Agata Savary | Jakub Waszczuk
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

This paper describes a manually annotated corpus of verbal multi-word expressions in Polish. It is among the 4 biggest datasets in release 1.2 of the PARSEME multiligual corpus. We describe the data sources, as well as the annotation process and its outcomes. We also present interesting phenomena encountered during the annotation task and put forward enhancements for the PARSEME annotation guidelines.

We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.

pdf abs
Supervised Disambiguation of German Verbal Idioms with a BiLSTM Architecture
Rafael Ehren | Timm Lichte | Laura Kallmeyer | Jakub Waszczuk
Proceedings of the Second Workshop on Figurative Language Processing

Supervised disambiguation of verbal idioms (VID) poses special demands on the quality and quantity of the annotated data used for learning and evaluation. In this paper, we present a new VID corpus for German and perform a series of VID disambiguation experiments on it. Our best classifier, based on a neural architecture, yields an error reduction across VIDs of 57% in terms of accuracy compared to a simple majority baseline.

pdf abs
Contemplata, a Free Platform for Constituency Treebank Annotation
Jakub Waszczuk | Ilaine Wang | Jean-Yves Antoine | Anaïs Halftermeyer
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes Contemplata, an annotation platform that offers a generic solution for treebank building as well as treebank enrichment with relations between syntactic nodes. Contemplata is dedicated to the annotation of constituency trees. The framework includes support for syntactic parsers, which provide automatic annotations to be manually revised. The balanced strategy of annotation between automatic parsing and manual revision allows to reduce the annotator workload, which favours data reliability. The paper presents the software architecture of Contemplata, describes its practical use and eventually gives two examples of annotation projects that were conducted on the platform.

pdf abs
Statistical Parsing of Tree Wrapping Grammars
Tatiana Bladier | Jakub Waszczuk | Laura Kallmeyer
Proceedings of the 28th International Conference on Computational Linguistics

We describe an approach to statistical parsing with Tree-Wrapping Grammars (TWG). TWG is a tree-rewriting formalism which includes the tree-combination operations of substitution, sister-adjunction and tree-wrapping substitution. TWGs can be extracted from constituency treebanks and aim at representing long distance dependencies (LDDs) in a linguistically adequate way. We present a parsing algorithm for TWGs based on neural supertagging and A* parsing. We extract a TWG for English from the treebanks for Role and Reference Grammar and discuss first parsing results with this grammar.

pdf
Automatic Extraction of Tree-Wrapping Grammars for Multiple Languages
Tatiana Bladier | Laura Kallmeyer | Rainer Osswald | Jakub Waszczuk
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf abs
A Neural Graph-based Approach to Verbal MWE Identification
Jakub Waszczuk | Rafael Ehren | Regina Stodden | Laura Kallmeyer
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We propose to tackle the problem of verbal multiword expression (VMWE) identification using a neural graph parsing-based approach. Our solution involves encoding VMWE annotations as labellings of dependency trees and, subsequently, applying a neural network to model the probabilities of different labellings. This strategy can be particularly effective when applied to discontinuous VMWEs and, thanks to dense, pre-trained word vector representations, VMWEs unseen during training. Evaluation of our approach on three PARSEME datasets (German, French, and Polish) shows that it allows to achieve performance on par with the previous state-of-the-art (Al Saied et al., 2018).

2018

pdf abs
TRAVERSAL at PARSEME Shared Task 2018: Identification of Verbal Multiword Expressions Using a Discriminative Tree-Structured Model
Jakub Waszczuk
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes a system submitted to the closed track of the PARSEME shared task (edition 1.1) on automatic identification of verbal multiword expressions (VMWEs). The system represents VMWE identification as a labeling task where one of two labels (MWE or not-MWE) must be predicted for each node in the dependency tree based on local context, including adjacent nodes and their labels. The system relies on multiclass logistic regression to determine the globally optimal labeling of a tree. The system ranked 1st in the general cross-lingual ranking of the closed track systems, according to both official evaluation measures: MWE-based F1 and token-based F1.

2017

pdf abs
Projecting Multiword Expression Resources on a Polish Treebank
Agata Savary | Jakub Waszczuk
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Multiword expressions (MWEs) are linguistic objects containing two or more words and showing idiosyncratic behavior at different levels. Treebanks with annotated MWEs enable studies of such properties, as well as training and evaluation of MWE-aware parsers. However, few treebanks contain full-fledged MWE annotations. We show how this gap can be bridged in Polish by projecting 3 MWE resources on a constituency treebank.

pdf
Multiword Expression-Aware A* TAG Parsing Revisited
Jakub Waszczuk | Agata Savary | Yannick Parmentier
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

pdf
Temporal@ODIL project: Adapting ISO-TimeML to syntactic treebanks for the temporal annotation of spoken speech
Jean-Yves Antoine | Jakub Wasczuk | Anaïs Lefeuvre-Haftermeyer | Lotfi Abouda | Emmanuel Schang | Agata Savary
Proceedings of the 13th Joint ISO-ACL Workshop on Interoperable Semantic Annotation (ISA-13)

2016

pdf abs
Promoting multiword expressions in A* TAG parsing
Jakub Waszczuk | Agata Savary | Yannick Parmentier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Multiword expressions (MWEs) are pervasive in natural languages and often have both idiomatic and compositional readings, which leads to high syntactic ambiguity. We show that for some MWE types idiomatic readings are usually the correct ones. We propose a heuristic for an A* parser for Tree Adjoining Grammars which benefits from this knowledge by promoting MWE-oriented analyses. This strategy leads to a substantial reduction in the parsing search space in case of true positive MWE occurrences, while avoiding parsing failures in case of false positives.

2012

pdf
Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language
Jakub Waszczuk
Proceedings of COLING 2012

2010

pdf abs
Towards the Annotation of Named Entities in the National Corpus of Polish
Agata Savary | Jakub Waszczuk | Adam Przepiórkowski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present the named entity annotation task within the on-going project of the National Corpus of Polish. To the best of our knowledge, this is the first attempt at a large-scale corpus annotation of Polish named entities. We describe the scope and the TEI-inspired hierarchy of named entities admitted for this task, as well as the TEI-conformant multi-level stand-off annotation format. We also discuss some methodological strategies including the annotation of embedded, coordinated and discontinuous names. Our annotation platform consists of two main tools interconnected by converting facilities. A rule-based natural language processing platform SProUT is used for the automatic pre-annotation of named entities, due to the previously created Polish extraction grammars adapted to the annotation task. A customizable graphical tree editor TrEd, extended to our needs, provides an ergonomic environment for manual correction of annotations. Despite some difficult cases encountered in the early annotation phase, about 2,600 named entities in 1,800 corpus sentences have presently been annotated, which allowed to validate the project methodology and tools.

Co-authors

Venues

mwe4
coling4
lrec2
law1
figlang1
show all...

bsnlp1

tag1

isa1

konvens1

tlt1