Vít Baisa


Benchmark Dataset for Propaganda Detection in Czech Newspaper Texts
Vít Baisa | Ondřej Herman | Ales Horak
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Propaganda of various pressure groups ranging from big economies to ideological blocks is often presented in a form of objective newspaper texts. However, the real objectivity is here shaded with the support of imbalanced views and distorted attitudes by means of various manipulative stylistic techniques. In the project of Manipulative Propaganda Techniques in the Age of Internet, a new resource for automatic analysis of stylistic mechanisms for influencing the readers’ opinion is developed. In its current version, the resource consists of 7,494 newspaper articles from four selected Czech digital news servers annotated for the presence of specific manipulative techniques. In this paper, we present the current state of the annotations and describe the structure of the dataset in detail. We also offer an evaluation of bag-of-words classification algorithms for the annotated manipulative techniques.


DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation–Maximization and Chunk-based Language Model
Ondřej Herman | Vít Suchomel | Vít Baisa | Pavel Rychlý
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we investigate two approaches to discrimination of similar languages: Expectation–maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6% and 88.3% on set A of the DSL Shared task 2016 competition.

VPS-GradeUp: Graded Decisions on Usage Patterns
Vít Baisa | Silvie Cinková | Ema Krejčová | Anna Vernerová
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present VPS-GradeUp ― a set of 11,400 graded human decisions on usage patterns of 29 English lexical verbs from the Pattern Dictionary of English Verbs by Patrick Hanks. The annotation contains, for each verb lemma, a batch of 50 concordances with the given lemma as KWIC, and for each of these concordances we provide a graded human decision on how well the individual PDEV patterns for this particular lemma illustrate the given concordance, indicated on a 7-point Likert scale for each PDEV pattern. With our annotation, we were pursuing a pilot investigation of the foundations of human clustering and disambiguation decisions with respect to usage patterns of verbs in context. The data set is publicly available at http://hdl.handle.net/11234/1-1585.

Graded and Word-Sense-Disambiguation Decisions in Corpus Pattern Analysis: a Pilot Study
Silvie Cinková | Ema Krejčová | Anna Vernerová | Vít Baisa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a pilot analysis of a new linguistic resource, VPS-GradeUp (available at http://hdl.handle.net/11234/1-1585). The resource contains 11,400 graded human decisions on usage patterns of 29 English lexical verbs, randomly selected from the Pattern Dictionary of English Verbs (Hanks, 2000 2014) based on their frequency and the number of senses their lemmas have in PDEV. This data set has been created to observe the interannotator agreement on PDEV patterns produced using the Corpus Pattern Analysis (Hanks, 2013). Apart from the graded decisions, the data set also contains traditional Word-Sense-Disambiguation (WSD) labels. We analyze the associations between the graded annotation and WSD annotation. The results of the respective annotations do not correlate with the size of the usage pattern inventory for the respective verbs lemmas, which makes the data set worth further linguistic analysis.

European Union Language Resources in Sketch Engine
Vít Baisa | Jan Michelfeit | Marek Medveď | Miloš Jakubíček
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Several parallel corpora built from European Union language resources are presented here. They were processed by state-of-the-art tools and made available for researchers in the corpus manager Sketch Engine. A completely new resource is introduced: EUR-Lex Corpus, being one of the largest parallel corpus available at the moment, containing 840 million English tokens and the largest language pair English-French has more than 25 million aligned segments (paragraphs).


Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
Vít Baisa | Aleš Horák | Marek Medveď
Proceedings of the Workshop Natural Language Processing for Translation Memories

SemEval-2015 Task 15: A CPA dictionary-entry-building task
Vít Baisa | Jane Bradbury | Silvie Cinková | Ismaïl El Maarouf | Adam Kilgarriff | Octavian Popescu
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)


Disambiguating Verbs by Collocation: Corpus Lexicography meets Natural Language Processing
Ismail El Maarouf | Jane Bradbury | Vít Baisa | Patrick Hanks
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper reports the results of Natural Language Processing (NLP) experiments in semantic parsing, based on a new semantic resource, the Pattern Dictionary of English Verbs (PDEV) (Hanks, 2013). This work is set in the DVC (Disambiguating Verbs by Collocation) project , a project in Corpus Lexicography aimed at expanding PDEV to a large scale. This project springs from a long-term collaboration of lexicographers with computer scientists which has given rise to the design and maintenance of specific, adapted, and user-friendly editing and exploration tools. Particular attention is drawn on the use of NLP deep semantic methods to help in data processing. Possible contributions of NLP include pattern disambiguation, the focus of this article. The present article explains how PDEV differs from other lexical resources and describes its structure in detail. It also presents new classification experiments on a subset of 25 verbs. The SVM model obtained a micro-average F1 score of 0.81.

Extrinsic Corpus Evaluation with a Collocation Dictionary Task
Adam Kilgarriff | Pavel Rychlý | Miloš Jakubíček | Vojtěch Kovář | Vít Baisa | Lucia Kocincová
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The NLP researcher or application-builder often wonders “what corpus should I use, or should I build one of my own? If I build one of my own, how will I know if I have done a good job?” Currently there is very little help available for them. They are in need of a framework for evaluating corpora. We develop such a framework, in relation to corpora which aim for good coverage of `general language’. The task we set is automatic creation of a publication-quality collocations dictionary. For a sample of 100 headwords of Czech and 100 of English, we identify a gold standard dataset of (ideally) all the collocations that should appear for these headwords in such a dictionary. The datasets are being made available alongside this paper. We then use them to determine precision and recall for a range of corpora, with a range of parameters.


Automatic classification of semantic patterns from the Pattern Dictionary of English Verbs
Ismaïl El Maarouf | Vít Baisa
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora