Aleksander Wawer

2022

pdf abs
Prepositions Matter in Quantifier Scope Disambiguation
Aleksander Leczkowski | Justyna Grudzińska | Manuel Vargas Guzmán | Aleksander Wawer | Aleksandra Siemieniuk
Proceedings of the 29th International Conference on Computational Linguistics

Although it is widely agreed that world knowledge plays a significant role in quantifier scope disambiguation (QSD), there has been only very limited work on how to integrate this knowledge into a QSD model. This paper contributes to this scarce line of research by incorporating into a machine learning model our knowledge about relations, as conveyed by a manageable closed class of function words: prepositions. For data, we use a scope-disambiguated corpus created by AnderBois, Brasoveanu and Henderson, which is additionally annotated with prepositional senses using Schneider et al’s Semantic Network of Adposition and Case Supersenses (SNACS) scheme. By applying Manshadi and Allen’s method to the corpus, we were able to inspect the information gain provided by prepositions for the QSD task. Statistical analysis of the performance of the classifiers, trained in scenarios with and without preposition information, supports the claim that prepositional senses have a strong positive impact on the learnability of automatic QSD systems.

2021

pdf abs
ComboNER: A Lightweight All-In-One POS Tagger, Dependency Parser and NER
Aleksander Wawer
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

The current natural language processing is strongly focused on raising accuracy. The progress comes at a cost of super-heavy models with hundreds of millions or even billions of parameters. However, simple syntactic tasks such as part-of-speech (POS) tagging, dependency parsing or named entity recognition (NER) do not require the largest models to achieve acceptable results. In line with this assumption we try to minimize the size of the model that jointly performs all three tasks. We introduce ComboNER: a lightweight tool, orders of magnitude smaller than state-of-the-art transformers. It is based on pre-trained subword embeddings and recurrent neural network architecture. ComboNER operates on Polish language data. The model has outputs for POS tagging, dependency parsing and NER. Our paper contains some insights from fine-tuning of the model and reports its overall results.

2019

pdf abs
SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
Bogdan Gliwa | Iwona Mochol | Maciej Biesek | Aleksander Wawer
Proceedings of the 2nd Workshop on New Frontiers in Summarization

This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.

pdf bib abs
Fact Checking or Psycholinguistics: How to Distinguish Fake and True Claims?
Aleksander Wawer | Grzegorz Wojdyga | Justyna Sarzyńska-Wawer
Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)

The goal of our paper is to compare psycholinguistic text features with fact checking approaches to distinguish lies from true statements. We examine both methods using data from a large ongoing study on deception and deception detection covering a mixture of factual and opinionated topics that polarize public opinion. We conclude that fact checking approaches based on Wikipedia are too limited for this task, as only a few percent of sentences from our study has enough evidence to become supported or refuted. Psycholinguistic features turn out to outperform both fact checking and human baselines, but the accuracy is not high. Overall, it appears that deception detection applicable to less-than-obvious topics is a difficult task and a problem to be solved.

pdf abs
Predicting Sentiment of Polish Language Short Texts
Aleksander Wawer | Julita Sobiczewska
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

The goal of this paper is to use all available Polish language data sets to seek the best possible performance in supervised sentiment analysis of short texts. We use text collections with labelled sentiment such as tweets, movie reviews and a sentiment treebank, in three comparison modes. In the first, we examine the performance of models trained and tested on the same text collection using standard cross-validation (in-domain). In the second we train models on all available data except the given test collection, which we use for testing (one vs rest cross-domain). In the third, we train a model on one data set and apply it to another one (one vs one cross-domain). We compare wide range of methods including machine learning on bag-of-words representation, bidirectional recurrent neural networks as well as the most recent pre-trained architectures ELMO and BERT. We formulate conclusions as to cross-domain and in-domain performance of each method. Unsurprisingly, BERT turned out to be a strong performer, especially in the cross-domain setting. What is surprising however, is solid performance of the relatively simple multinomial Naive Bayes classifier, which performed equally well as BERT on several data sets.

pdf abs
TMLab SRPOL at SemEval-2019 Task 8: Fact Checking in Community Question Answering Forums
Piotr Niewiński | Aleksander Wawer | Maria Pszona | Maria Janicka
Proceedings of the 13th International Workshop on Semantic Evaluation

The article describes our submission to SemEval 2019 Task 8 on Fact-Checking in Community Forums. The systems under discussion participated in Subtask A: decide whether a question asks for factual information, opinion/advice or is just socializing. Our primary submission was ranked as the second one among all participants in the official evaluation phase. The article presents our primary solution: Deeply Regularized Residual Neural Network (DRR NN) with Universal Sentence Encoder embeddings. This is followed by a description of two contrastive solutions based on ensemble methods.

2018

pdf abs
Literal, Metphorical or Both? Detecting Metaphoricity in Isolated Adjective-Noun Phrases
Agnieszka Mykowiecka | Malgorzata Marciniak | Aleksander Wawer
Proceedings of the Workshop on Figurative Language Processing

The paper addresses the classification of isolated Polish adjective-noun phrases according to their metaphoricity. We tested neural networks to predict if a phrase has a literal or metaphorical sense or can have both senses depending on usage. The input to the neural network consists of word embeddings, but we also tested the impact of information about the domain of the adjective and about the abstractness of the noun. We applied our solution to English data available on the Internet and compared it to results published in papers. We found that the solution based on word embeddings only can achieve results comparable with complex solutions requiring additional information.

pdf abs
Detecting Figurative Word Occurrences Using Recurrent Neural Networks
Agnieszka Mykowiecka | Aleksander Wawer | Malgorzata Marciniak
Proceedings of the Workshop on Figurative Language Processing

The paper addresses detection of figurative usage of words in English text. The chosen method was to use neural nets fed by pretrained word embeddings. The obtained results show that simple solutions, based on words embeddings only, are comparable to complex solutions, using many sources of information which are not available for languages less-studied than English.

pdf abs
Multi-Module Recurrent Neural Networks with Transfer Learning
Filip Skurniak | Maria Janicka | Aleksander Wawer
Proceedings of the Workshop on Figurative Language Processing

This paper describes multiple solutions designed and tested for the problem of word-level metaphor detection. The proposed systems are all based on variants of recurrent neural network architectures. Specifically, we explore multiple sources of information: pre-trained word embeddings (Glove), a dictionary of language concreteness and a transfer learning scenario based on the states of an encoder network from neural network machine translation system. One of the architectures is based on combining all three systems: (1) Neural CRF (Conditional Random Fields), trained directly on the metaphor data set; (2) Neural Machine Translation encoder of a transfer learning scenario; (3) a neural network used to predict final labels, trained directly on the metaphor data set. Our results vary between test sets: Neural CRF standalone is the best one on submission data, while combined system scores the highest on a test subset randomly selected from training data.

pdf
The Linguistic Category Model in Polish (LCM-PL)
Aleksander Wawer | Justyna Sarzyńska
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Detecting Metaphorical Phrases in the Polish Language
Aleksander Wawer | Agnieszka Mykowiecka
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In this paper we describe experiments with automated detection of metaphors in the Polish language. We focus our analysis on noun phrases composed of an adjective and a noun, and distinguish three types of expressions: with literal sense, with metaphorical sense, and expressions both literal and methaphorical (context-dependent). We propose a method of automatically recognizing expression type using word embeddings and neural networks. We evaluate multiple neural network architectures and demonstrate that the method significantly outperforms strong baselines.

pdf abs
Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms
Aleksander Wawer | Agnieszka Mykowiecka
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first is unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to learn a context-sensitive transformation that maps an input vector of ambiguous word into an output vector representing its sense. We evaluate both methods on corpora with manual annotations of word senses from the Polish wordnet (plWordnet).

2016

pdf abs
OPFI: A Tool for Opinion Finding in Polish
Aleksander Wawer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The paper contains a description of OPFI: Opinion Finder for the Polish Language, a freely available tool for opinion target extraction. The goal of the tool is opinion finding: a task of identifying tuples composed of sentiment (positive or negative) and its target (about what or whom is the sentiment expressed). OPFI is not dependent on any particular method of sentiment identification and provides a built-in sentiment dictionary as a convenient option. Technically, it contains implementations of three different modes of opinion tuple generation: one hybrid based on dependency parsing and CRF, the second based on shallow parsing and the third on deep learning, namely GRU neural network. The paper also contains a description of related language resources: two annotated treebanks and one set of tweets.

2012

pdf
Mining Co-Occurrence Matrices for SO-PMI Paradigm Word Candidates
Aleksander Wawer
Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2010

pdf abs
Is Sentiment a Property of Synsets? Evaluating Resources for Sentiment Classification using Machine Learning
Aleksander Wawer
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Existing approaches to classifying documents by sentiment include machine learning with features created from n-grams and part of speech. This paper explores a different approach and examines performance of one selected machine learning algorithm, Support Vector Machines, with features computed using existing lexical resources. Special attention has been paid to fine tuning of the algorithm regarding number of features. The immediate purpose of this experiment is to evaluate lexical and sentiment resources in document-level sentiment classification task. Results described in the paper are also useful to indicate how lexicon design, different language dimensions and semantic categories contribute to document-level sentiment recognition. In a less direct way (through the examination of evaluated resources), the experiment analyzes adequacy of lexemes, word senses and synsets as different possible layers for ascribing sentiment, or as candidates for sentiment carriers. The proposed approach of machine learning word category frequencies instead of n-grams and part of speech features can potentially exhibit improvements in domain independency, but this hypothesis has to be verified in future works.