Lukasz Augustyniak


2020

bib
Political Advertising Dataset: the use case of the Polish 2020 Presidential Elections
Lukasz Augustyniak | Krzysztof Rajda | Tomasz Kajdanowicz | Michał Bernaczyk
Proceedings of the The Fourth Widening Natural Language Processing Workshop

Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categories, which constitute campaigning under Polish electoral law. We achieved a 0.65 inter-annotator agreement (Cohen’s kappa score). An additional annotator resolved the mismatches between the first two annotators improving the consistency and complexity of the annotation process. We used the newly created dataset to train a well established neural tagger (achieving a 70% percent points F1 score). We also present a possible direction of use cases for such datasets and models with an initial analysis of the Polish 2020 Presidential Elections on Twitter.

pdf bib
WER we are and WER we think we are
Piotr Szymański | Piotr Żelasko | Mikolaj Morzy | Adrian Szymczak | Marzena Żyła-Hoppe | Joanna Banaszczak | Lukasz Augustyniak | Jan Mizgajski | Yishay Carmiel
Findings of the Association for Computational Linguistics: EMNLP 2020

Natural language processing of conversational speech requires the availability of high-quality transcripts. In this paper, we express our skepticism towards the recent reports of very low Word Error Rates (WERs) achieved by modern Automatic Speech Recognition (ASR) systems on benchmark datasets. We outline several problems with popular benchmarks and compare three state-of-the-art commercial ASR systems on an internal dataset of real-life spontaneous human conversations and HUB’05 public benchmark. We show that WERs are significantly higher than the best reported results. We formulate a set of guidelines which may aid in the creation of real-life, multi-domain datasets with high quality annotations for training and testing of robust ASR systems.