Pavel Straňák


Compiling Czech Parliamentary Stenographic Protocols into a Corpus
Barbora Hladka | Matyáš Kopp | Pavel Straňák
Proceedings of the Second ParlaCLARIN Workshop

The Parliament of the Czech Republic consists of two chambers: the Chamber of Deputies (Lower House) and the Senate (Upper House). In our work, we focus on agenda and documents that relate to the Chamber of Deputies exclusively. We pay particular attention to stenographic protocols that record the Chamber of Deputies’ meetings. Our overall goal is to (1) compile the protocols into a ParlaCLARIN TEI encoded corpus, (2) make this corpus accessible and searchable in the TEITOK web-based platform, (3) annotate the corpus using the modules available in TEITOK, e.g. detect and recognize named entities, and (4) highlight the annotations in TEITOK. In addition, we add two more goals that we consider innovative: (5) update the corpus every time a new stenographic protocol is published online by the Chambers of Deputies and (6) expose the annotations as the linked open data in order to improve the protocols’ interoperability with other existing linked open data. This paper is devoted to the goals (1) and (5).


Bridging the LAPPS Grid and CLARIN
Erhard Hinrichs | Nancy Ide | James Pustejovsky | Jan Hajič | Marie Hinrichs | Mohammad Fazleh Elahi | Keith Suderman | Marc Verhagen | Kyeongmin Rim | Pavel Straňák | Jozef Mišutka
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Diacritics Restoration Using Neural Networks
Jakub Náplava | Milan Straka | Pavel Straňák | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)


The Public License Selector: Making Open Licensing Easier
Pawel Kamocki | Pavel Straňák | Michal Sedlák
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Researchers in Natural Language Processing rely on availability of data and software, ideally under open licenses, but little is done to actively encourage it. In fact, the current Copyright framework grants exclusive rights to authors to copy their works, make them available to the public and make derivative works (such as annotated language corpora). Moreover, in the EU databases are protected against unauthorized extraction and re-utilization of their contents. Therefore, proper public licensing plays a crucial role in providing access to research data. A public license is a license that grants certain rights not to one particular user, but to the general public (everybody). Our article presents a tool that we developed and whose purpose is to assist the user in the licensing process. As software and data should be licensed under different licenses, the tool is composed of two separate parts: Data and Software. The underlying logic as well as elements of the graphic interface are presented below.

Improving corpus search via parsing
Natalia Klyueva | Pavel Straňák
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we describe an addition to the corpus query system Kontext that enables to enhance the search using syntactic attributes in addition to the existing features, mainly lemmas and morphological categories. We present the enhancements of the corpus query system itself, the attributes we use to represent syntactic structures in data, and some examples of querying the syntactically annotated corpora, such as treebanks in various languages as well as an automatically parsed large corpus.


HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation
Ondřej Bojar | Vojtěch Diatka | Pavel Rychlý | Pavel Straňák | Vít Suchomel | Aleš Tamchyna | Daniel Zeman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.


Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures
Eduard Bejček | Pavel Straňák | Pavel Pecina
Proceedings of the 9th Workshop on Multiword Expressions


Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0
Eduard Bejček | Jarmila Panevová | Jan Popelka | Pavel Straňák | Magda Ševčíková | Jan Štěpánek | Zdeněk Žabokrtský
Proceedings of COLING 2012

Korektor – A System for Contextual Spell-Checking and Diacritics Completion
Michal Richter | Pavel Straňák | Alexandr Rosen
Proceedings of COLING 2012: Posters


Data Issues in English-to-Hindi Machine Translation
Ondřej Bojar | Pavel Straňák | Daniel Zeman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Statistical machine translation to morphologically richer languages is a challenging task and more so if the source and target languages differ in word order. Current state-of-the-art MT systems thus deliver mediocre results. Adding more parallel data often helps improve the results; if it doesn't, it may be caused by various problems such as different domains, bad alignment or noise in the new data. In this paper we evaluate the English-to-Hindi MT task from this data perspective. We discuss several available parallel data sources and provide cross-evaluation results on their combinations using two freely available statistical MT systems. We demonstrate various problems encountered in the data and describe automatic methods of data cleaning and normalization. We also show that the contents of two independently distributed data sets can unexpectedly overlap, which negatively affects translation quality. Together with the error analysis, we also present a new tool for viewing aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the target language.


pdf bib
The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages
Jan Hajič | Massimiliano Ciaramita | Richard Johansson | Daisuke Kawahara | Maria Antònia Martí | Lluís Màrquez | Adam Meyers | Joakim Nivre | Sebastian Padó | Jan Štěpánek | Pavel Straňák | Mihai Surdeanu | Nianwen Xue | Yi Zhang
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task


Annotation of Multiword Expressions in the Prague Dependency Treebank
Eduard Bejček | Pavel Straňák | Pavel Schlesinger
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II