Barbora Hladká

Also published as: Barbora Hladka, B. Hladká


2026

We present a method of attribution source detection and classification in Czech. A plain text (typically, a newspaper article) enters the SouDec system, gets parsed with the external tool UDPipe into Universal-Dependencies style of sentence representation, and then is analyzed for occurrences of attribution signals and sources. The list of attribution signals has been extracted from a corpus of Czech newspaper articles annotated with interlinked attribution signals and sources, and has been complemented with context and syntax information to help distinguish relevant occurrences of the signals. The SouDec system further classifies the attribution sources in one of five classes: anonymous, partially anonymous, unofficial, official non-political and official political, using information from another external tool, a recognizer and classifier of named entities, NameTag 3. While our source detection method gets results comparable to existing systems for other languages, further improvements can be achieved by incorporating fully-fledged automatic coreference resolution into the classification method. In a focused case study, we test a possible usage of SouDeC for distinguishing domain-specific texts of less vs. more reputable origin.

2022

This paper focuses on detection of sources in the Czech articles published on a news server of Czech public radio. In particular, we search for attribution in sentences and we recognize attributed sources and their sentence context (signals). We organized a crowdsourcing annotation task that resulted in a data set of 2,167 stories with manually recognized signals and sources. In addition, the sources were classified into the classes of named and unnamed sources.

2020

The Parliament of the Czech Republic consists of two chambers: the Chamber of Deputies (Lower House) and the Senate (Upper House). In our work, we focus on agenda and documents that relate to the Chamber of Deputies exclusively. We pay particular attention to stenographic protocols that record the Chamber of Deputies’ meetings. Our overall goal is to (1) compile the protocols into a ParlaCLARIN TEI encoded corpus, (2) make this corpus accessible and searchable in the TEITOK web-based platform, (3) annotate the corpus using the modules available in TEITOK, e.g. detect and recognize named entities, and (4) highlight the annotations in TEITOK. In addition, we add two more goals that we consider innovative: (5) update the corpus every time a new stenographic protocol is published online by the Chambers of Deputies and (6) expose the annotations as the linked open data in order to improve the protocols’ interoperability with other existing linked open data. This paper is devoted to the goals (1) and (5).

2018

2017

We summarize the involvement of our CEMI team in the ”NLI Shared Task 2017”, which deals with both textual and speech input data. We submitted the results achieved by using three different system architectures; each of them combines multiple supervised learning models trained on various feature sets. As expected, better results are achieved with the systems that use both the textual data and the spoken responses. Combining the input data of two different modalities led to a rather dramatic improvement in classification performance. Our best performing method is based on a set of feed-forward neural networks whose hidden-layer outputs are combined together using a softmax layer. We achieved a macro-averaged F1 score of 0.9257 on the evaluation (unseen) test set and our team placed first in the main task together with other three teams.
We present a pilot study on parsing non-native texts written by learners of Czech. We performed experiments that have shown that at least high-level syntactic functions, like subject, predicate, and object, can be assigned based on a parser trained on standard native language.

2016

We introduce a new member of the family of Prague dependency treebanks. The Czech Legal Text Treebank 1.0 is a morphologically and syntactically annotated corpus of 1,128 sentences. The treebank contains texts from the legal domain, namely the documents from the Collection of Laws of the Czech Republic. Legal texts differ from other domains in several language phenomena influenced by rather high frequency of very long sentences. A manual annotation of such sentences presents a new challenge. We describe a strategy and tools for this task. The resulting treebank can be explored in various ways. It can be downloaded from the LINDAT/CLARIN repository and viewed locally using the TrEd editor or it can be accessed on-line using the KonText and TreeQuery tools.

2015

2014

2013

2012

We present a new way to get more morphologically and syntactically annotated data. We have developed an annotation editor tailored to school children to involve them in text annotation. Using this editor, they practice morphology and dependency-based syntax in the same way as they normally do at (Czech) schools, without any special training. Their annotation is then automatically transformed into the target annotation schema. The editor is designed to be language independent, however the subsequent transformation is driven by the annotation framework we are heading for. In our case, the object language is Czech and the target annotation scheme corresponds to the Prague Dependency Treebank annotation framework.

2009

2008

2005

2001

2000

1998

1997