Piotr Pęzik


2020

pdf bib
The MARCELL Legislative Corpus
Tamás Váradi | Svetla Koeva | Martin Yamalov | Marko Tadić | Bálint Sass | Bartłomiej Nitoń | Maciej Ogrodniczuk | Piotr Pęzik | Verginica Barbu Mititelu | Radu Ion | Elena Irimia | Maria Mitrofan | Vasile Păiș | Dan Tufiș | Radovan Garabík | Simon Krek | Andraz Repar | Matjaž Rihtar | Janez Brank
Proceedings of the 12th Language Resources and Evaluation Conference

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

2018

pdf bib
Increasing the Accessibility of Time-Aligned Speech Corpora with Spokes Mix
Piotr Pęzik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2014

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

2012

pdf bib
Towards a comprehensive open repository of Polish language resources
Maciej Ogrodniczuk | Piotr Pęzik | Adam Przepiórkowski
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The aim of this paper is to present current efforts towards the creation of a comprehensive open repository of Polish language resources and tools (LRTs). The work described here is carried out within the CESAR project, member of the META-NET consortium. It has already resulted in the creation of the Computational Linguistics in Poland site containing an exhaustive collection of Polish LRTs. Current work is focused on the creation of new LRTs and, esp., the enhancement of existing LRTs, such as parallel corpora, annotated corpora of written and spoken Polish and morphological dictionaries to be made available via the META-SHARE repository.

2010

pdf bib
Recent Developments in the National Corpus of Polish
Adam Przepiórkowski | Rafał L. Górski | Marek Łaziński | Piotr Pęzik
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The aim of the paper is to present recent ― as of March 2010 ― developments in the construction of the National Corpus of Polish (NKJP). The NKJP project was launched at the very end of 2007 and it is aimed at compiling a large, linguistically annotated corpus of contemporary Polish by the end of 2010. Out of the total pool of 1 billion words of text data collected in the project, a 300 million word balanced corpus will be selected to match a set of predefined representativeness criteria. This present paper outlines a number of recent developments in the NKJP project, including: 1) the design of text encoding XML schemata for various levels of linguistic information, 2) a new tool for manual annotation at various levels, 3) numerous improvements in search tools. As the work on NKJP progresses, it becomes clear that this project serves as an important testbed for linguistic annotation and interoperability standards. We believe that our recent experiences will prove relevant to future large-scale language resource compilation efforts.