Jan Šnajder

Also published as: Jan Snajder


2021

pdf bib
PANDORA Talks: Personality and Demographics on Reddit
Matej Gjurković | Mladen Karan | Iva Vukojević | Mihaela Bošnjak | Jan Snajder
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the well-established Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.

2020

pdf bib
Staying True to Your Word: (How) Can Attention Become Explanation?
Martin Tutek | Jan Snajder
Proceedings of the 5th Workshop on Representation Learning for NLP

The attention mechanism has quickly become ubiquitous in NLP. In addition to improving performance of models, attention has been widely used as a glimpse into the inner workings of NLP models. The latter aspect has in the recent years become a common topic of discussion, most notably in recent work of Jain and Wallace; Wiegreffe and Pinter. With the shortcomings of using attention weights as a tool of transparency revealed, the attention mechanism has been stuck in a limbo without concrete proof when and whether it can be used as an explanation. In this paper, we provide an explanation as to why attention has seen rightful critique when used with recurrent networks in sequence classification tasks. We propose a remedy to these issues in the form of a word level objective and our findings give credibility for attention to provide faithful interpretations of recurrent models.

pdf bib
Improved Local Citation Recommendation Based on Context Enhanced with Global Information
Zoran Medić | Jan Snajder
Proceedings of the First Workshop on Scholarly Document Processing

Local citation recommendation aims at finding articles relevant for given citation context. While most previous approaches represent context using solely text surrounding the citation, we propose enhancing context representation with global information. Specifically, we include citing article’s title and abstract into context representation. We evaluate our model on datasets with different citation context sizes and demonstrate improvements with globally-enhanced context representations when citation contexts are smaller.

2019

pdf bib
Preemptive Toxic Language Detection in Wikipedia Comments Using Thread-Level Context
Mladen Karan | Jan Šnajder
Proceedings of the Third Workshop on Abusive Language Online

We address the task of automatically detecting toxic content in user generated texts. We fo cus on exploring the potential for preemptive moderation, i.e., predicting whether a particular conversation thread will, in the future, incite a toxic comment. Moreover, we perform preliminary investigation of whether a model that jointly considers all comments in a conversation thread outperforms a model that considers only individual comments. Using an existing dataset of conversations among Wikipedia contributors as a starting point, we compile a new large-scale dataset for this task consisting of labeled comments and comments from their conversation threads.

pdf bib
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Michał Marcińczuk | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Analysing Rhetorical Structure as a Key Feature of Summary Coherence
Jan Šnajder | Tamara Sladoljev-Agejev | Svjetlana Kolić Vehovec
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present a model for automatic scoring of coherence based on comparing the rhetorical structure (RS) of college student summaries in L2 (English) against expert summaries. Coherence is conceptualised as a construct consisting of the rhetorical relation and its arguments. Comparison with expert-assigned scores shows that RS scores correlate with both cohesion and coherence. Furthermore, RS scores improve the accuracy of a regression model for cohesion score prediction.

pdf bib
Evaluating Automatic Term Extraction Methods on Individual Documents
Antonio Šajatović | Maja Buljan | Jan Šnajder | Bojana Dalbelo Bašić
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Automatic Term Extraction (ATE) extracts terminology from domain-specific corpora. ATE is used in many NLP tasks, including Computer Assisted Translation, where it is typically applied to individual documents rather than the entire corpus. While corpus-level ATE has been extensively evaluated, it is not obvious how the results transfer to document-level ATE. To fill this gap, we evaluate 16 state-of-the-art ATE methods on full-length documents from three different domains, on both corpus and document levels. Unlike existing studies, our evaluation is more realistic as we take into account all gold terms. We show that no single method is best in corpus-level ATE, but C-Value and KeyConceptRelatendess surpass others in document-level ATE.

pdf bib
TakeLab at SemEval-2019 Task 4: Hyperpartisan News Detection
Niko Palić | Juraj Vladika | Dominik Čubelić | Ivan Lovrenčić | Maja Buljan | Jan Šnajder
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we demonstrate the system built to solve the SemEval-2019 task 4: Hyperpartisan News Detection (Kiesel et al., 2019), the task of automatically determining whether an article is heavily biased towards one side of the political spectrum. Our system receives an article in its raw, textual form, analyzes it, and predicts with moderate accuracy whether the article is hyperpartisan. The learning model used was primarily trained on a manually prelabeled dataset containing news articles. The system relies on the previously constructed SVM model, available in the Python Scikit-Learn library. We ranked 6th in the competition of 42 teams with an accuracy of 79.1% (the winning team had 82.2%).

2018

pdf bib
TakeLab at SemEval-2018 Task 7: Combining Sparse and Dense Features for Relation Classification in Scientific Texts
Martin Gluhak | Maria Pia di Buono | Abbas Akkasi | Jan Šnajder
Proceedings of The 12th International Workshop on Semantic Evaluation

We describe two systems for semantic relation classification with which we participated in the SemEval 2018 Task 7, subtask 1 on semantic relation classification: an SVM model and a CNN model. Both models combine dense pretrained word2vec features and hancrafted sparse features. For training the models, we combine the two datasets provided for the subtasks in order to balance the under-represented classes. The SVM model performed better than CNN, achieving a F1-macro score of 69.98% on subtask 1.1 and 75.69% on subtask 1.2. The system ranked 7th on among 28 submissions on subtask 1.1 and 7th among 20 submissions on subtask 1.2.

pdf bib
TakeLab at SemEval-2018 Task12: Argument Reasoning Comprehension with Skip-Thought Vectors
Ana Brassard | Tin Kuculo | Filip Boltužić | Jan Šnajder
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our system for the SemEval-2018 Task 12: Argument Reasoning Comprehension Task. We utilize skip-thought vectors, sentence-level distributional vectors inspired by the popular word embeddings and the skip-gram model. We encode preprocessed sentences from the dataset into vectors, then perform a binary supervised classification of the warrant that justifies the use of the reason as support for the claim. We explore a few variations of the model, reaching 54.1% accuracy on the test set, which placed us 16th out of 22 teams participating in the task.

pdf bib
Lexical Substitution for Evaluating Compositional Distributional Models
Maja Buljan | Sebastian Padó | Jan Šnajder
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Compositional Distributional Semantic Models (CDSMs) model the meaning of phrases and sentences in vector space. They have been predominantly evaluated on limited, artificial tasks such as semantic sentence similarity on hand-constructed datasets. This paper argues for lexical substitution (LexSub) as a means to evaluate CDSMs. LexSub is a more natural task, enables us to evaluate meaning composition at the level of individual words, and provides a common ground to compare CDSMs with dedicated LexSub models. We create a LexSub dataset for CDSM evaluation from a corpus with manual “all-words” LexSub annotation. Our experiments indicate that the Practical Lexical Function CDSM outperforms simple component-wise CDSMs and performs on par with the context2vec LexSub model using the same context.

pdf bib
Reddit: A Gold Mine for Personality Prediction
Matej Gjurković | Jan Šnajder
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Automated personality prediction from social media is gaining increasing attention in natural language processing and social sciences communities. However, due to high labeling costs and privacy issues, the few publicly available datasets are of limited size and low topic diversity. We address this problem by introducing a large-scale dataset derived from Reddit, a source so far overlooked for personality prediction. The dataset is labeled with Myers-Briggs Type Indicators (MBTI) and comes with a rich set of features for more than 9k users. We carry out a preliminary feature analysis, revealing marked differences between the MBTI dimensions and poles. Furthermore, we use the dataset to train and evaluate benchmark personality prediction models, achieving macro F1-scores between 67% and 82% on the individual dimensions and 82% accuracy for exact or one-off accurate type prediction. These results are encouraging and comparable with the reliability of standardized tests.

pdf bib
Combining Shallow and Deep Learning for Aggressive Text Detection
Viktor Golem | Mladen Karan | Jan Šnajder
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

We describe the participation of team TakeLab in the aggression detection shared task at the TRAC1 workshop for English. Aggression manifests in a variety of ways. Unlike some forms of aggression that are impossible to prevent in day-to-day life, aggressive speech abounding on social networks could in principle be prevented or at least reduced by simply disabling users that post aggressively worded messages. The first step in achieving this is to detect such messages. The task, however, is far from being trivial, as what is considered as aggressive speech can be quite subjective, and the task is further complicated by the noisy nature of user-generated text on social networks. Our system learns to distinguish between open aggression, covert aggression, and non-aggression in social media texts. We tried different machine learning approaches, including traditional (shallow) machine learning models, deep learning models, and a combination of both. We achieved respectable results, ranking 4th and 8th out of 31 submissions on the Facebook and Twitter test sets, respectively.

pdf bib
Cross-Domain Detection of Abusive Language Online
Mladen Karan | Jan Šnajder
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

We investigate to what extent the models trained to detect general abusive language generalize between different datasets labeled with different abusive language types. To this end, we compare the cross-domain performance of simple classification models on nine different datasets, finding that the models fail to generalize to out-domain datasets and that having at least some in-domain data is important. We also show that using the frustratingly simple domain adaptation (Daume III, 2007) in most cases improves the results over in-domain training, especially when used to augment a smaller dataset with a larger one.

pdf bib
Iterative Recursive Attention Model for Interpretable Sequence Classification
Martin Tutek | Jan Šnajder
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Natural language processing has greatly benefited from the introduction of the attention mechanism. However, standard attention models are of limited interpretability for tasks that involve a series of inference steps. We describe an iterative recursive attention model, which constructs incremental representations of input data through reusing results of previously computed queries. We train our model on sentiment classification datasets and demonstrate its capacity to identify and combine different aspects of the input in an easily interpretable manner, while obtaining performance close to the state of the art.

pdf bib
Not Just Depressed: Bipolar Disorder Prediction on Reddit
Ivan Sekulic | Matej Gjurković | Jan Šnajder
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Bipolar disorder, an illness characterized by manic and depressive episodes, affects more than 60 million people worldwide. We present a preliminary study on bipolar disorder prediction from user-generated text on Reddit, which relies on users’ self-reported labels. Our benchmark classifiers for bipolar disorder prediction outperform the baselines and reach accuracy and F1-scores of above 86%. Feature analysis shows interesting differences in language use between users with bipolar disorders and the control group, including differences in the use of emotion-expressive words.

2017

pdf bib
Using Analytic Scoring Rubrics in the Automatic Assessment of College-Level Summary Writing Tasks in L2
Tamara Sladoljev-Agejev | Jan Šnajder
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Assessing summaries is a demanding, yet useful task which provides valuable information on language competence, especially for second language learners. We consider automated scoring of college-level summary writing task in English as a second language (EL2). We adopt the Reading-for-Understanding (RU) cognitive framework, extended with the Reading-to-Write (RW) element, and use analytic scoring with six rubrics covering content and writing quality. We show that regression models with reference-based and linguistic features considerably outperform the baselines across all the rubrics. Moreover, we find interesting correlations between summary features and analytic rubrics, revealing the links between the RU and RW constructs.

pdf bib
Does Free Word Order Hurt? Assessing the Practical Lexical Function Model for Croatian
Zoran Medić | Jan Šnajder | Sebastian Padó
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

The Practical Lexical Function (PLF) model is a model of computational distributional semantics that attempts to strike a balance between expressivity and learnability in predicting phrase meaning and shows competitive results. We investigate how well the PLF carries over to free word order languages, given that it builds on observations of predicate-argument combinations that are harder to recover in free word order languages. We evaluate variants of the PLF for Croatian, using a new lexical substitution dataset. We find that the PLF works about as well for Croatian as for English, but demonstrate that its strength lies in modeling verbs, and that the free word order affects the less robust PLF variant.

pdf bib
TakeLab-QA at SemEval-2017 Task 3: Classification Experiments for Answer Retrieval in Community QA
Filip Šaina | Toni Kukurin | Lukrecija Puljić | Mladen Karan | Jan Šnajder
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper we present the TakeLab-QA entry to SemEval 2017 task 3, which is a question-comment re-ranking problem. We present a classification based approach, including two supervised learning models – Support Vector Machines (SVM) and Convolutional Neural Networks (CNN). We use features based on different semantic similarity models (e.g., Latent Dirichlet Allocation), as well as features based on several types of pre-trained word embeddings. Moreover, we also use some hand-crafted task-specific features. For training, our system uses no external labeled data apart from that provided by the organizers. Our primary submission achieves a MAP-score of 81.14 and F1-score of 66.99 – ranking us 10th on the SemEval 2017 task 3, subtask A.

pdf bib
TakeLab at SemEval-2017 Task 6: #RankingHumorIn4Pages
Marin Kukovačec | Juraj Malenica | Ivan Mršić | Antonio Šajatović | Domagoj Alagić | Jan Šnajder
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our system for humor ranking in tweets within the SemEval 2017 Task 6: #HashtagWars (6A and 6B). For both subtasks, we use an off-the-shelf gradient boosting model built on a rich set of features, handcrafted to provide the model with the external knowledge needed to better predict the humor in the text. The features capture various cultural references and specific humor patterns. Our system ranked 2nd (officially 7th) among 10 submissions on the Subtask A and 2nd among 9 submissions on the Subtask B.

pdf bib
TakeLab at SemEval-2017 Task 4: Recent Deaths and the Power of Nostalgia in Sentiment Analysis in Twitter
David Lozić | Doria Šarić | Ivan Tokić | Zoran Medić | Jan Šnajder
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes the system we submitted to SemEval-2017 Task 4 (Sentiment Analysis in Twitter), specifically subtasks A, B, and D. Our main focus was topic-based message polarity classification on a two-point scale (subtask B). The system we submitted uses a Support Vector Machine classifier with rich set of features, ranging from standard to more creative, task-specific features, including a series of rating-based features as well as features that account for sentimental reminiscence of past topics and deceased famous people. Our system ranked 14th out of 39 submissions in subtask A, 5th out of 24 submissions in subtask B, and 3rd out of 16 submissions in subtask D.

pdf bib
TakeLab at SemEval-2017 Task 5: Linear aggregation of word embeddings for fine-grained sentiment analysis of financial news
Leon Rotim | Martin Tutek | Jan Šnajder
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes our system for fine-grained sentiment scoring of news headlines submitted to SemEval 2017 task 5–subtask 2. Our system uses a feature-light method that consists of a Support Vector Regression (SVR) with various kernels and word vectors as features. Our best-performing submission scored 3rd on the task out of 29 teams and 4th out of 45 submissions with a cosine score of 0.733.

pdf bib
Two Layers of Annotation for Representing Event Mentions in News Stories
Maria Pia di Buono | Martin Tutek | Jan Šnajder | Goran Glavaš | Bojana Dalbelo Bašić | Nataša Milić-Frayling
Proceedings of the 11th Linguistic Annotation Workshop

In this paper, we describe our preliminary study on annotating event mention as a part of our research on high-precision news event extraction models. To this end, we propose a two-layer annotation scheme, designed to separately capture the functional and conceptual aspects of event mentions. We hypothesize that the precision of models can be improved by modeling and extracting separately the different aspects of news events, and then combining the extracted information by leveraging the complementarities of the models. In addition, we carry out a preliminary annotation using the proposed scheme and analyze the annotation quality in terms of inter-annotator agreement.

pdf bib
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Tomaž Erjavec | Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

pdf bib
A Preliminary Study of Croatian Lexical Substitution
Domagoj Alagić | Jan Šnajder
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Lexical substitution is a task of determining a meaning-preserving replacement for a word in context. We report on a preliminary study of this task for the Croatian language on a small-scale lexical sample dataset, manually annotated using three different annotation schemes. We compare the annotations, analyze the inter-annotator agreement, and observe a number of interesting language specific details in the obtained lexical substitutes. Furthermore, we apply a recently-proposed, dependency-based lexical substitution model to our dataset. The model achieves a P@3 score of 0.35, which indicates the difficulty of the task.

pdf bib
Debunking Sentiment Lexicons: A Case of Domain-Specific Sentiment Classification for Croatian
Paula Gombar | Zoran Medić | Domagoj Alagić | Jan Šnajder
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Sentiment lexicons are widely used as an intuitive and inexpensive way of tackling sentiment classification, often within a simple lexicon word-counting approach or as part of a supervised model. However, it is an open question whether these approaches can compete with supervised models that use only word-representation features. We address this question in the context of domain-specific sentiment classification for Croatian. We experiment with the graph-based acquisition of sentiment lexicons, analyze their quality, and investigate how effectively they can be used in sentiment classification. Our results indicate that, even with as few as 500 labeled instances, a supervised model substantially outperforms a word-counting model. We also observe that adding lexicon-based features does not significantly improve supervised sentiment classification.

pdf bib
Comparison of Short-Text Sentiment Analysis Methods for Croatian
Leon Rotim | Jan Šnajder
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

We focus on the task of supervised sentiment classification of short and informal texts in Croatian, using two simple yet effective methods: word embeddings and string kernels. We investigate whether word embeddings offer any advantage over corpus- and preprocessing-free string kernels, and how these compare to bag-of-words baselines. We conduct a comparison on three different datasets, using different preprocessing methods and kernel functions. Results show that, on two out of three datasets, word embeddings outperform string kernels, which in turn outperform word and n-gram bag-of-words baselines.

pdf bib
The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Josef Steinberger | Roman Yangarber
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper describes the outcomes of the first challenge on multilingual named entity recognition that aimed at recognizing mentions of named entities in web documents in Slavic languages, their normalization/lemmatization, and cross-language matching. It was organised in the context of the 6th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2017 conference. Although eleven teams signed up for the evaluation, due to the complexity of the task(s) and short time available for elaborating a solution, only two teams submitted results on time. The reported evaluation figures reflect the relatively higher level of complexity of named entity-related tasks in the context of processing texts in Slavic languages. Since the duration of the challenge goes beyond the date of the publication of this paper and updated picture of the participating systems and their corresponding performance can be found on the web page of the challenge.

pdf bib
Combining Linguistic Features for the Detection of Croatian Multiword Expressions
Maja Buljan | Jan Šnajder
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

As multiword expressions (MWEs) exhibit a range of idiosyncrasies, their automatic detection warrants the use of many different features. Tsvetkov and Wintner (2014) proposed a Bayesian network model that combines linguistically motivated features and also models their interactions. In this paper, we extend their model with new features and apply it to Croatian, a morphologically complex and a relatively free word order language, achieving a satisfactory performance of 0.823 F1-score. Furthermore, by comparing against (semi)naive Bayes models, we demonstrate that manually modeling feature interactions is indeed important. We make our annotated dataset of Croatian MWEs freely available.

pdf bib
Predicting News Values from Headline Text and Emotions
Maria Pia di Buono | Jan Šnajder | Bojana Dalbelo Bašić | Goran Glavaš | Martin Tutek | Natasa Milic-Frayling
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We present a preliminary study on predicting news values from headline text and emotions. We perform a multivariate analysis on a dataset manually annotated with news values and emotions, discovering interesting correlations among them. We then train two competitive machine learning models – an SVM and a CNN – to predict news values from headline text and emotions as features. We find that, while both models yield a satisfactory performance, some news values are more difficult to detect than others, while some profit more from including emotion information.

pdf bib
Toward Stance Classification Based on Claim Microstructures
Filip Boltužić | Jan Šnajder
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Claims are the building blocks of arguments and the reasons underpinning opinions, thus analyzing claims is important for both argumentation mining and opinion mining. We propose a framework for representing claims as microstructures, which express the beliefs, judgments, and policies about the relations between domain-specific concepts. In a proof-of-concept study, we manually build microstructures for over 800 claims extracted from an online debate. We test the so-obtained microstructures on the task of claim stance classification, achieving considerable improvements over text-based baselines.

pdf bib
Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice
Julian Brooke | Jan Šnajder | Timothy Baldwin
Transactions of the Association for Computational Linguistics, Volume 5

We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.

2016

pdf bib
Analysis of Policy Agendas: Lessons Learned from Automatic Topic Classification of Croatian Political Texts
Mladen Karan | Jan Šnajder | Daniela Širinić | Goran Glavaš
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Fill the Gap! Analyzing Implicit Premises between Claims from Online Debates
Filip Boltužić | Jan Šnajder
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib
Predictability of Distributional Semantics in Derivational Word Formation
Sebastian Padó | Aurélie Herbelot | Max Kisselew | Jan Šnajder
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Compositional distributional semantic models (CDSMs) have successfully been applied to the task of predicting the meaning of a range of linguistic constructions. Their performance on semi-compositional word formation process of (morphological) derivation, however, has been extremely variable, with no large-scale empirical investigation to date. This paper fills that gap, performing an analysis of CDSM predictions on a large dataset (over 30,000 German derivationally related word pairs). We use linear regression models to analyze CDSM performance and obtain insights into the linguistic factors that influence how predictable the distributional context of a derived word is going to be. We identify various such factors, notably part of speech, argument structure, and semantic regularity.

pdf bib
Cro36WSD: A Lexical Sample for Croatian Word Sense Disambiguation
Domagoj Alagić | Jan Šnajder
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We introduce Cro36WSD, a freely-available medium-sized lexical sample for Croatian word sense disambiguation (WSD).Cro36WSD comprises 36 words: 12 adjectives, 12 nouns, and 12 verbs, balanced across both frequency bands and polysemy levels. We adopt the multi-label annotation scheme in the hope of lessening the drawbacks of discrete sense inventories and obtaining more realistic annotations from human experts. Sense-annotated data is collected through multiple annotation rounds to ensure high-quality annotations: with a 115 person-hours effort we reached an inter-annotator agreement score of 0.877. We analyze the obtained data and perform a correlation analysis between several relevant variables, including word frequency, number of senses, sense distribution skewness, average annotation time, and the observed inter-annotator agreement (IAA). Using the obtained data, we compile multi- and single-labeled dataset variants using different label aggregation schemes. Finally, we evaluate three different baseline WSD models on both dataset variants and report on the insights gained. We make both dataset variants freely available.

pdf bib
VerbCROcean: A Repository of Fine-Grained Semantic Verb Relations for Croatian
Ivan Sekulić | Jan Šnajder
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we describe VerbCROcean, a broad-coverage repository of fine-grained semantic relations between Croatian verbs. Adopting the methodology of Chklovski and Pantel (2004) used for acquiring the English VerbOcean, we first acquire semantically related verb pairs from a web corpus hrWaC by relying on distributional similarity of subject-verb-object paths in the dependency trees. We then classify the semantic relations between each pair of verbs as similarity, intensity, antonymy, or happens-before, using a number of manually-constructed lexico-syntatic patterns. We evaluate the quality of the resulting resource on a manually annotated sample of 1000 semantic verb relations. The evaluation revealed that the predictions are most accurate for the similarity relation, and least accurate for the intensity relation. We make available two variants of VerbCROcean: a coverage-oriented version, containing about 36k verb pairs at a precision of 41%, and a precision-oriented version containing about 5k verb pairs, at a precision of 56%.

pdf bib
Graph-Based Induction of Word Senses in Croatian
Marko Bekavac | Jan Šnajder
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word sense induction (WSI) seeks to induce senses of words from unannotated corpora. In this paper, we address the WSI task for the Croatian language. We adopt the word clustering approach based on co-occurrence graphs, in which senses are taken to correspond to strongly inter-connected components of co-occurring words. We experiment with a number of graph construction techniques and clustering algorithms, and evaluate the sense inventories both as a clustering problem and extrinsically on a word sense disambiguation (WSD) task. In the cluster-based evaluation, Chinese Whispers algorithm outperformed Markov Clustering, yielding a normalized mutual information score of 64.3. In contrast, in WSD evaluation Markov Clustering performed better, yielding an accuracy of about 75%. We are making available two induced sense inventories of 10,000 most frequent Croatian words: one coarse-grained and one fine-grained inventory, both obtained using the Markov Clustering algorithm.

pdf bib
TakeLab at SemEval-2016 Task 6: Stance Classification in Tweets Using a Genetic Algorithm Based Ensemble
Martin Tutek | Ivan Sekulić | Paula Gombar | Ivan Paljak | Filip Čulinović | Filip Boltužić | Mladen Karan | Domagoj Alagić | Jan Šnajder
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Obtaining a Better Understanding of Distributional Models of German Derivational Morphology
Max Kisselew | Sebastian Padó | Alexis Palmer | Jan Šnajder
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Identifying Prominent Arguments in Online Debates Using Semantic Textual Similarity
Filip Boltužić | Jan Šnajder
Proceedings of the 2nd Workshop on Argumentation Mining

pdf bib
The 5th Workshop on Balto-Slavic Natural Language Processing
Jakub Piskorski | Lidia Pivovarova | Jan Šnajder | Hristo Tanev | Roman Yangarber
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Resolving Entity Coreference in Croatian with a Constrained Mention-Pair Model
Goran Glavaš | Jan Šnajder
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Experiments on Active Learning for Croatian Word Sense Disambiguation
Domagoj Alagić | Jan Šnajder
The 5th Workshop on Balto-Slavic Natural Language Processing

pdf bib
TKLBLIIR: Detecting Twitter Paraphrases with TweetingJay
Mladen Karan | Goran Glavaš | Jan Šnajder | Bojana Dalbelo Bašić | Ivan Vulić | Marie-Francine Moens
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
HiEve: A Corpus for Extracting Event Hierarchies from News Stories
Goran Glavaš | Jan Šnajder | Marie-Francine Moens | Parisa Kordjamshidi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In news stories, event mentions denote real-world events of different spatial and temporal granularity. Narratives in news stories typically describe some real-world event of coarse spatial and temporal granularity along with its subevents. In this work, we present HiEve, a corpus for recognizing relations of spatiotemporal containment between events. In HiEve, the narratives are represented as hierarchies of events based on relations of spatiotemporal containment (i.e., superevent―subevent relations). We describe the process of manual annotation of HiEve. Furthermore, we build a supervised classifier for recognizing spatiotemporal containment between events to serve as a baseline for future research. Preliminary experimental results are encouraging, with classifier performance reaching 58% F1-score, only 11% less than the inter annotator agreement.

pdf bib
DerivBase.hr: A High-Coverage Derivational Morphology Resource for Croatian
Jan Šnajder
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Knowledge about derivational morphology has been proven useful for a number of natural language processing (NLP) tasks. We describe the construction and evaluation of DerivBase.hr, a large-coverage morphological resource for Croatian. DerivBase.hr groups 100k lemmas from web corpus hrWaC into 56k clusters of derivationally related lemmas, so-called derivational families. We focus on suffixal derivation between and within nouns, verbs, and adjectives. We propose two approaches: an unsupervised approach and a knowledge-based approach based on a hand-crafted morphology model but without using any additional lexico-semantic resources The resource acquisition procedure consists of three steps: corpus preprocessing, acquisition of an inflectional lexicon, and the induction of derivational families. We describe an evaluation methodology based on manually constructed derivational families from which we sample and annotate pairs of lemmas. We evaluate DerivBase.hr on the so-obtained sample, and show that the knowledge-based version attains good clustering quality of 81.2% precision, 76.5% recall, and 78.8% F1 -score. As with similar resources for other languages, we expect DerivBase.hr to be useful for a number of NLP tasks.

pdf bib
Back up your Stance: Recognizing Arguments in Online Discussions
Filip Boltužić | Jan Šnajder
Proceedings of the First Workshop on Argumentation Mining

pdf bib
Constructing Coherent Event Hierarchies from News Stories
Goran Glavaš | Jan Šnajder
Proceedings of TextGraphs-9: the workshop on Graph-based Methods for Natural Language Processing

pdf bib
Towards Semantic Validation of a Derivational Lexicon
Britta Zeller | Sebastian Padó | Jan Šnajder
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Aspect-Oriented Opinion Mining from User Reviews in Croatian
Goran Glavaš | Damir Korenčić | Jan Šnajder
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Frequently Asked Questions Retrieval for Croatian Based on Semantic Textual Similarity
Mladen Karan | Lovro Žmak | Jan Šnajder
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
GPKEX: Genetically Programmed Keyphrase Extraction from Croatian Texts
Marko Bekavac | Jan Šnajder
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Event-Centered Information Retrieval Using Kernels on Event Graphs
Goran Glavaš | Jan Šnajder
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing

pdf bib
DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
Britta Zeller | Jan Šnajder | Sebastian Padó
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Derivational Smoothing for Syntactic Distributional Semantics
Sebastian Padó | Jan Šnajder | Britta Zeller
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Building and Evaluating a Distributional Memory for Croatian
Jan Šnajder | Sebastian Padó | Željko Agić
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Recognizing Identical Events with Graph Kernels
Goran Glavaš | Jan Šnajder
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Experiments on Hybrid Corpus-Based Sentiment Lexicon Acquisition
Goran Glavaš | Jan Šnajder | Bojana Dalbelo Bašić
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

pdf bib
TakeLab: Systems for Measuring Semantic Text Similarity
Frane Šarić | Goran Glavaš | Mladen Karan | Jan Šnajder | Bojana Dalbelo Bašić
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Evaluation of Classification Algorithms and Features for Collocation Extraction in Croatian
Mladen Karan | Jan Šnajder | Bojana Dalbelo Bašić
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Collocations can be defined as words that occur together significantly more often than it would be expected by chance. Many natural language processing applications such as natural language generation, word sense disambiguation and machine translation can benefit from having access to information about collocated words. We approach collocation extraction as a classification problem where the task is to classify a given n-gram as either a collocation (positive) or a non-collocation (negative). Among the features used are word frequencies, classical association measures (Dice, PMI, chi2), and POS tags. In addition, semantic word relatedness modeled by latent semantic analysis is also included. We apply wrapper feature subset selection to determine the best set of features. Performance of various classification algorithms is tested. Experiments are conducted on a manually annotated set of bigrams and trigrams sampled from a Croatian newspaper corpus. Best results obtained are 79.8 F1 measure for bigrams and 67.5 F1 measure for trigrams. The best classifier for bigrams was SVM, while for trigrams the decision tree gave the best performance. Features which contributed the most to overall performance were PMI, semantic relatedness, and POS information.

2010

pdf bib
Corpus Aligner (CorAl) Evaluation on English-Croatian Parallel Corpora
Sanja Seljan | Marko Tadić | Željko Agić | Jan Šnajder | Bojana Dalbelo Bašić | Vjekoslav Osmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An increasing demand for new language resources of recent EU members and accessing countries has in turn initiated the development of different language tools and resources, such as alignment tools and corresponding translation memories for new languages pairs. The primary goal of this paper is to provide a description of a free sentence alignment tool CorAl (Corpus Aligner), developed at the Faculty of Electrical Engineering and Computing, University of Zagreb. The tool performs paragraph alignment at the first step of the alignment process, which is followed by sentence alignment. Description of the tool is followed by its evaluation. The paper describes an experiment with applying the CorAl aligner to a English-Croatian parallel corpus of legislative domain using metrics of precision, recall and F1-measure. Results are discussed and the concluding sections discuss future directions of CorAl development.

2009

pdf bib
String Distance-Based Stemming of the Highly Inflected Croatian Language
Jan Šnajder | Bojana Dalbelo Bašić
Proceedings of the International Conference RANLP-2009

2008

pdf bib
Evolving New Lexical Association Measures Using Genetic Programming
Jan Šnajder | Bojana Dalbelo Bašić | Saša Petrović | Ivan Sikirić
Proceedings of ACL-08: HLT, Short Papers