2023
pdf
abs
People and Places of Historical Europe: Bootstrapping Annotation Pipeline and a New Corpus of Named Entities in Late Medieval Texts
Vit Novotny
|
Kristina Luger
|
Michal Štefánik
|
Tereza Vrabcova
|
Ales Horak
Findings of the Association for Computational Linguistics: ACL 2023
Although pre-trained named entity recognition (NER) models are highly accurate on modern corpora, they underperform on historical texts due to differences in language OCR errors. In this work, we develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus. Using our corpus, we train a NER model that achieves entity-level Precision of 72.81–93.98% with 58.14–81.77% Recall on a manually-annotated test dataset. Furthermore, we show that using a weighted loss function helps to combat class imbalance in token classification tasks. To make it easy for others to reproduce and build upon our work, we publicly release our corpus, models, and experimental code.
2019
pdf
abs
Benchmark Dataset for Propaganda Detection in Czech Newspaper Texts
Vít Baisa
|
Ondřej Herman
|
Ales Horak
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Propaganda of various pressure groups ranging from big economies to ideological blocks is often presented in a form of objective newspaper texts. However, the real objectivity is here shaded with the support of imbalanced views and distorted attitudes by means of various manipulative stylistic techniques. In the project of Manipulative Propaganda Techniques in the Age of Internet, a new resource for automatic analysis of stylistic mechanisms for influencing the readers’ opinion is developed. In its current version, the resource consists of 7,494 newspaper articles from four selected Czech digital news servers annotated for the presence of specific manipulative techniques. In this paper, we present the current state of the annotations and describe the structure of the dataset in detail. We also offer an evaluation of bag-of-words classification algorithms for the annotated manipulative techniques.
2016
pdf
abs
DEBVisDic: Instant Wordnet Building
Adam Rambousek
|
Ales Horak
Proceedings of the 8th Global WordNet Conference (GWC)
The semantic network editor DEBVisDic has been used by different development teams to create more than 20 national wordnets. The editor was recently re-developed as a multi-platform web-based application for general semantic networks editing. One of the main advantages, when compared to the previous implementation, lies in the fact that no client-side installation is needed now. Following the successful first phase in building the Open Dutch Wordnet, DEBVisDic was extended with features that allow users to easily create, edit, and share a new (usually national) wordnet without the need of any complicated configuration or advanced technical skills. The DEBVisDic editor provides advanced features for wordnet browsing, editing, and visualization. Apart from the user-friendly web-based application, DEBVisDic also provides an API interface to integrate the semantic network data into external applications.
2015
pdf
Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
Vít Baisa
|
Aleš Horák
|
Marek Medveď
Proceedings of the Workshop Natural Language Processing for Translation Memories
2012
pdf
abs
Similarity Ranking as Attribute for Machine Learning Approach to Authorship Identification
Jan Rygl
|
Aleš Horák
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In the authorship identification task, examples of short writings of N authors and an anonymous document written by one of these N authors are given. The task is to determine the authorship of the anonymous text. Practically all approaches solved this problem with machine learning methods. The input attributes for the machine learning process are usually formed by stylistic or grammatical properties of individual documents or a defined similarity between a document and an author. In this paper, we present the results of an experiment to extend the machine learning attributes by ranking the similarity between a document and an author: we transform the similarity between an unknown document and one of the N authors to the order in which the author is the most similar to the document in the set of N authors. The comparison of similarity probability and similarity ranking was made using the Support Vector Machines algorithm. The results show that machine learning methods perform slightly better with attributes based on the ranking of similarity than with previously used similarity between an author and a document.
2007
pdf
Verb Valency Semantic Representation for Deep Linguistic Processing
Aleš Horák
|
Karel Pala
|
Marie Duží
|
Pavel Materna
ACL 2007 Workshop on Deep Linguistic Processing
2006
pdf
Platform for Full-Syntax Grammar Development Using Meta-grammar Constructs
Aleš Horák
|
Vladimír Kadlec
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation
2002
pdf
Best Analysis Selection in Inflectional Languages
Aleš Horák
|
Pavel Smrž
COLING 2002: The 19th International Conference on Computational Linguistics
2001
pdf
Efficient Sentence Parsing with Language Specific Features: A Case Study of Czech
Aleš Horák
|
Pavel Smrž
Proceedings of the Seventh International Workshop on Parsing Technologies
2000
pdf
Large Scale Parsing of Czech
Pavel Smrž
|
Aleš Horák
Proceedings of the COLING-2000 Workshop on Efficiency In Large-Scale Parsing Systems