Josef Steinberger


2022

pdf
Czech Dataset for Cross-lingual Subjectivity Classification
Pavel Přibáň | Josef Steinberger
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions. Our prime motivation is to provide a reliable dataset that can be used with the existing English dataset as a benchmark to test the ability of pre-trained multilingual models to transfer knowledge between Czech and English and vice versa. Two annotators annotated the dataset reaching 0.83 of the Cohen’s K inter-annotator agreement. To the best of our knowledge, this is the first subjectivity dataset for the Czech language. We also created an additional dataset that consists of 200k automatically labeled sentences. Both datasets are freely available for research purposes. Furthermore, we fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy. We fine-tune models on the existing English dataset for which we obtained results that are on par with the current state-of-the-art results. Finally, we perform zero-shot cross-lingual subjectivity classification between Czech and English to verify the usability of our dataset as the cross-lingual benchmark. We compare and discuss the cross-lingual and monolingual results and the ability of multilingual models to transfer knowledge between languages.

2021

pdf
Are the Multilingual Models Better? Improving Czech Sentiment with Transformers
Pavel Přibáň | Josef Steinberger
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we aim at improving Czech sentiment with transformer-based models and their multilingual versions. More concretely, we study the task of polarity detection for the Czech language on three sentiment polarity datasets. We fine-tune and perform experiments with five multilingual and three monolingual models. We compare the monolingual and multilingual models’ performance, including comparison with the older approach based on recurrent neural networks. Furthermore, we test the multilingual models and their ability to transfer knowledge from English to Czech (and vice versa) with zero-shot cross-lingual classification. Our experiments show that the huge multilingual models can overcome the performance of the monolingual models. They are also able to detect polarity in another language without any training data, with performance not worse than 4.4 % compared to state-of-the-art monolingual trained models. Moreover, we achieved new state-of-the-art results on all three datasets.

pdf
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych | Olga Kanishcheva | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Vasyl Starko | Josef Steinberger | Roman Yangarber | Michał Marcińczuk | Senja Pollak | Pavel Přibáň | Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

pdf
Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Bogdan Babych | Zara Kancheva | Olga Kanishcheva | Maria Lebedeva | Michał Marcińczuk | Preslav Nakov | Petya Osenova | Lidia Pivovarova | Senja Pollak | Pavel Přibáň | Ivaylo Radev | Marko Robnik-Sikonja | Vasyl Starko | Josef Steinberger | Roman Yangarber
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.

2019

pdf
Machine Learning Approach to Fact-Checking in West Slavic Languages
Pavel Přibáň | Tomáš Hercig | Josef Steinberger
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Fake news detection and closely-related fact-checking have recently attracted a lot of attention. Automatization of these tasks has been already studied for English. For other languages, only a few studies can be found (e.g. (Baly et al., 2018)), and to the best of our knowledge, no research has been conducted for West Slavic languages. In this paper, we present datasets for Czech, Polish, and Slovak. We also ran initial experiments which set a baseline for further research into this area.

pdf
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Michał Marcińczuk | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

pdf
The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Laska Laskova | Michał Marcińczuk | Lidia Pivovarova | Pavel Přibáň | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.

2018

pdf
UWB at SemEval-2018 Task 10: Capturing Discriminative Attributes from Word Distributions
Tomáš Brychcín | Tomáš Hercig | Josef Steinberger | Michal Konkol
Proceedings of the 12th International Workshop on Semantic Evaluation

We present our UWB system for the task of capturing discriminative attributes at SemEval 2018. Given two words and an attribute, the system decides, whether this attribute is discriminative between the words or not. Assuming Distributional Hypothesis, i.e., a word meaning is related to the distribution across contexts, we introduce several approaches to compare word contextual information. We experiment with state-of-the-art semantic spaces and with simple co-occurrence statistics. We show the word distribution in the corpus has potential for detecting discriminative attributes. Our system achieves F1 score 72.1% and is ranked #4 among 26 submitted systems.

2017

pdf
Cross-lingual Flames Detection in News Discussions
Josef Steinberger | Tomáš Brychcín | Tomáš Hercig | Peter Krejzl
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We introduce Flames Detector, an online system for measuring flames, i.e. strong negative feelings or emotions, insults or other verbal offences, in news commentaries across five languages. It is designed to assist journalists, public institutions or discussion moderators to detect news topics which evoke wrangles. We propose a machine learning approach to flames detection and calculate an aggregated score for a set of comment threads. The demo application shows the most flaming topics of the current period in several language variants. The search functionality gives a possibility to measure flames in any topic specified by a query. The evaluation shows that the flame detection in discussions is a difficult task, however, the application can already reveal interesting information about the actual news discussions.

pdf
Pyramid-based Summary Evaluation Using Abstract Meaning Representation
Josef Steinberger | Peter Krejzl | Tomáš Brychcín
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We propose a novel metric for evaluating summary content coverage. The evaluation framework follows the Pyramid approach to measure how many summarization content units, considered important by human annotators, are contained in an automatic summary. Our approach automatizes the evaluation process, which does not need any manual intervention on the evaluated summary side. Our approach compares abstract meaning representations of each content unit mention and each summary sentence. We found that the proposed metric complements well the widely-used ROUGE metrics.

pdf
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres
George Giannakopoulos | Elena Lloret | John M. Conroy | Josef Steinberger | Marina Litvak | Peter Rankel | Benoit Favre
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

pdf
MultiLing 2017 Overview
George Giannakopoulos | John Conroy | Jeff Kubina | Peter A. Rankel | Elena Lloret | Josef Steinberger | Marina Litvak | Benoit Favre
Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres

In this brief report we present an overview of the MultiLing 2017 effort and workshop, as implemented within EACL 2017. MultiLing is a community-driven initiative that pushes the state-of-the-art in Automatic Summarization by providing data sets and fostering further research and development of summarization systems. This year the scope of the workshop was widened, bringing together researchers that work on summarization across sources, languages and genres. We summarize the main tasks planned and implemented this year, the contributions received, and we also provide insights on next steps.

pdf
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

pdf
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.

2016

pdf
UWB at SemEval-2016 Task 6: Stance Detection
Peter Krejzl | Josef Steinberger
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
MediaGist: A Cross-lingual Analyser of Aggregated News and Commentaries
Josef Steinberger
Proceedings of ACL-2016 System Demonstrations

pdf
The OnForumS corpus from the Shared Task on Online Forum Summarisation at MultiLing 2015
Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio | Josef Steinberger | Jorge Valderrama | Hugo Zaragoza
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the OnForumS corpus developed for the shared task of the same name on Online Forum Summarisation (OnForumS at MultiLing’15). The corpus consists of a set of news articles with associated readers’ comments from The Guardian (English) and La Repubblica (Italian). It comes with four levels of annotation: argument structure, comment-article linking, sentiment and coreference. The former three were produced through crowdsourcing, whereas the latter, by an experienced annotator using a mature annotation scheme. Given its annotation breadth, we believe the corpus will prove a useful resource in stimulating and furthering research in the areas of Argumentation Mining, Summarisation, Sentiment, Coreference and the interlinks therein.

2015

pdf
Towards Multilingual Event Extraction Evaluation: A Case Study for the Czech Language
Josef Steinberger | Hristo Tanev
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf
MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations
George Giannakopoulos | Jeff Kubina | John Conroy | Josef Steinberger | Benoit Favre | Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

pdf
UWB: Machine Learning Approach to Aspect-Based Sentiment Analysis
Tomáš Brychcín | Michal Konkol | Josef Steinberger
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf
Aspect-Level Sentiment Analysis in Czech
Josef Steinberger | Tomáš Brychcín | Michal Konkol
Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2013

pdf
Sentiment Analysis in Czech Social Media Using Supervised Machine Learning
Ivan Habernal | Tomáš Ptáček | Josef Steinberger
Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf
Semi-automatic Acquisition of Lexical Resources and Grammars for Event Extraction in Bulgarian and Czech
Hristo Tanev | Josef Steinberger
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf
Multi-document multilingual summarization corpus preparation, Part 2: Czech, Hebrew and Spanish
Michael Elhadad | Sabino Miranda-Jiménez | Josef Steinberger | George Giannakopoulos
Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization

pdf
The UWB Summariser at Multiling-2013
Josef Steinberger
Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization

2012

pdf
Machine Translation for Multilingual Summary Content Evaluation
Josef Steinberger | Marco Turchi
Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization

pdf
Relevance Ranking for Translated Texts
Marco Turchi | Josef Steinberger | Lucia Specia
Proceedings of the 16th Annual conference of the European Association for Machine Translation

2011

pdf
Highly Multilingual Coreference Resolution Exploiting a Mature Entity Repository
Josef Steinberger | Jenya Belyaeva | Jonathan Crawley | Leonida Della-Rocca | Mohamed Ebrahim | Maud Ehrmann | Mijail Kabadjov | Ralf Steinberger | Erik van der Goot
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Multilingual Entity-Centered Sentiment Analysis Evaluated by Parallel Corpora
Josef Steinberger | Polina Lenkova | Mijail Kabadjov | Ralf Steinberger | Erik van der Goot
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf
Creating Sentiment Dictionaries via Triangulation
Josef Steinberger | Polina Lenkova | Mohamed Ebrahim | Maud Ehrmann | Ali Hurriyetoglu | Mijail Kabadjov | Ralf Steinberger | Hristo Tanev | Vanni Zavarella | Silvia Vázquez
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2010

pdf
Wrapping up a Summary: From Representation to Generation
Josef Steinberger | Marco Turchi | Mijail Kabadjov | Ralf Steinberger | Nello Cristianini
Proceedings of the ACL 2010 Conference Short Papers

2009

pdf
Summarizing Opinions in Blog Threads
Alexandra Balahur | Mijail Kabadjov | Josef Steinberger | Ralf Steinberger | Andrés Montoyo
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2005

pdf
Improving LSA-based Summarization with Anaphora Resolution
Josef Steinberger | Mijail Kabadjov | Massimo Poesio | Olivia Sanchez-Graillet
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing