Sanja Štajner

Also published as: Sanja Stajner


2021

pdf bib
Why Is MBTI Personality Detection from Texts a Difficult Task?
Sanja Stajner | Seren Yenikent
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Automatic detection of the four MBTI personality dimensions from texts has recently attracted noticeable attention from the natural language processing and computational linguistic communities. Despite the large collections of Twitter data for training, the best systems rarely even outperform the majority-class baseline. In this paper, we discuss the theoretical reasons for such low results and present the insights from an annotation study that further shed the light on this issue.

pdf bib
Exploring Reliability of Gold Labels for Emotion Detection in Twitter
Sanja Stajner
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Emotion detection from social media posts has attracted noticeable attention from natural language processing (NLP) community in recent years. The ways for obtaining gold labels for training and testing of the systems for automatic emotion detection differ significantly from one study to another, and pose the question of reliability of gold labels and obtained classification results. This study systematically explores several ways for obtaining gold labels for Ekman’s emotion model on Twitter data and the influence of the chosen strategy on the manual classification results.

pdf bib
How to Obtain Reliable Labels for MBTI Classification from Texts?
Sanja Stajner | Seren Yenikent
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Automatic detection of the Myers-Briggs Type Indicator (MBTI) from short posts attracted noticeable attention in the last few years. Recent studies showed that this is quite a difficult task, especially on commonly used Twitter data. Obtaining MBTI labels is also difficult, as human annotation requires trained psychologists, and automatic way of obtaining them is through long questionnaires of questionable usability for the task. In this paper, we present a method for collecting reliable MBTI labels via only four carefully selected questions that can be applied to any type of textual data.

pdf bib
Automatic Text Simplification for Social Good: Progress and Challenges
Sanja Stajner
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
What Motivates You? Benchmarking Automatic Detection of Basic Needs from Short Posts
Sanja Stajner | Seren Yenikent | Bilal Ghanem | Marc Franco-Salvador
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

According to the self-determination theory, the levels of satisfaction of three basic needs (competence, autonomy and relatedness) have implications on people’s everyday life and career. We benchmark the novel task of automatically detecting those needs on short posts in English, by modelling it as a ternary classification task, and as three binary classification tasks. A detailed manual analysis shows that the latter has advantages in the real-world scenario, and that our best models achieve similar performances as a trained human annotator.

2020

pdf bib
A Survey of Automatic Personality Detection from Texts
Sanja Stajner | Seren Yenikent
Proceedings of the 28th International Conference on Computational Linguistics

Personality profiling has long been used in psychology to predict life outcomes. Recently, automatic detection of personality traits from written messages has gained significant attention in computational linguistics and natural language processing communities, due to its applicability in various fields. In this survey, we show the trajectory of research towards automatic personality detection from purely psychology approaches, through psycholinguistics, to the recent purely natural language processing approaches on large datasets automatically extracted from social media. We point out what has been gained and what lost during that trajectory, and show what can be realistic expectations in the field.

pdf bib
When Shallow is Good Enough: Automatic Assessment of Conceptual Text Complexity using Shallow Semantic Features
Sanja Stajner | Ioana Hulpuș
Proceedings of the 12th Language Resources and Evaluation Conference

According to psycholinguistic studies, the complexity of concepts used in a text and the relations between mentioned concepts play the most important role in text understanding and maintaining reader’s interest. However, the classical approaches to automatic assessment of text complexity, and their commercial applications, take into consideration mainly syntactic and lexical complexity. Recently, we introduced the task of automatic assessment of conceptual text complexity, proposing a set of graph-based deep semantic features using DBpedia as a proxy to human knowledge. Given that such graphs can be noisy, incomplete, and computationally expensive to deal with, in this paper, we propose the use of textual features and shallow semantic features that only require entity linking. We compare the results obtained with new features with those of the state-of-the-art deep semantic features on two tasks: (1) pairwise comparison of two versions of the same text; and (2) five-level classification of texts. We find that the shallow features achieve state-of-the-art results on both tasks, significantly outperforming performances of the deep semantic features on the five-level classification task. Interestingly, the combination of the shallow and deep semantic features lead to a significant improvement of the performances on that task.

pdf bib
CoCo: A Tool for Automatically Assessing Conceptual Complexity of Texts
Sanja Stajner | Sergiu Nisioi | Ioana Hulpuș
Proceedings of the 12th Language Resources and Evaluation Conference

Traditional text complexity assessment usually takes into account only syntactic and lexical text complexity. The task of automatic assessment of conceptual text complexity, important for maintaining reader’s interest and text adaptation for struggling readers, has only been proposed recently. In this paper, we present CoCo - a tool for automatic assessment of conceptual text complexity, based on using the current state-of-the-art unsupervised approach. We make the code and API freely available for research purposes, and describe the code and the possibility for its personalization and adaptation in details. We compare the current implementation with the state of the art, discussing the influence of the choice of entity linker on the performances of the tool. Finally, we present results obtained on two widely used text simplification corpora, discussing the full potential of the tool.

2019

pdf bib
A Spreading Activation Framework for Tracking Conceptual Complexity of Texts
Ioana Hulpuș | Sanja Štajner | Heiner Stuckenschmidt
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose an unsupervised approach for assessing conceptual complexity of texts, based on spreading activation. Using DBpedia knowledge graph as a proxy to long-term memory, mentioned concepts become activated and trigger further activation as the text is sequentially traversed. Drawing inspiration from psycholinguistic theories of reading comprehension, we model memory processes such as semantic priming, sentence wrap-up, and forgetting. We show that our models capture various aspects of conceptual text complexity and significantly outperform current state of the art.

pdf bib
SymantoResearch at SemEval-2019 Task 3: Combined Neural Models for Emotion Classification in Human-Chatbot Conversations
Angelo Basile | Marc Franco-Salvador | Neha Pawar | Sanja Štajner | Mara Chinea Rios | Yassine Benajiba
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we present our participation to the EmoContext shared task on detecting emotions in English textual conversations between a human and a chatbot. We propose four neural systems and combine them to further improve the results. We show that our neural ensemble systems can successfully distinguish three emotions (SAD, HAPPY, and ANGRY) and separate them from the rest (OTHERS) in a highly-imbalanced scenario. Our best system achieved a 0.77 F1-score and was ranked fourth out of 165 submissions.

pdf bib
Automated Text Simplification as a Preprocessing Step for Machine Translation into an Under-resourced Language
Sanja Štajner | Maja Popović
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this work, we investigate the possibility of using fully automatic text simplification system on the English source in machine translation (MT) for improving its translation into an under-resourced language. We use the state-of-the-art automatic text simplification (ATS) system for lexically and syntactically simplifying source sentences, which are then translated with two state-of-the-art English-to-Serbian MT systems, the phrase-based MT (PBMT) and the neural MT (NMT). We explore three different scenarios for using the ATS in MT: (1) using the raw output of the ATS; (2) automatically filtering out the sentences with low grammaticality and meaning preservation scores; and (3) performing a minimal manual correction of the ATS output. Our results show improvement in fluency of the translation regardless of the chosen scenario, and difference in success of the three scenarios depending on the MT approach used (PBMT or NMT) with regards to improving translation fluency and post-editing effort.

2018

pdf bib
Automatic Assessment of Conceptual Text Complexity Using Knowledge Graphs
Sanja Štajner | Ioana Hulpuş
Proceedings of the 27th International Conference on Computational Linguistics

Complexity of texts is usually assessed only at the lexical and syntactic levels. Although it is known that conceptual complexity plays a significant role in text understanding, no attempts have been made at assessing it automatically. We propose to automatically estimate the conceptual complexity of texts by exploiting a number of graph-based measures on a large knowledge base. By using a high-quality language learners corpus for English, we show that graph-based measures of individual text concepts, as well as the way they relate to each other in the knowledge graph, have a high discriminative power when distinguishing between two versions of the same text. Furthermore, when used as features in a binary classification task aiming to choose the simpler of two versions of the same text, our measures achieve high performance even in a default setup.

pdf bib
Data-Driven Text Simplification
Sanja Štajner | Horacio Saggion
Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks: English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks: binary classification and probabilistic classification. A total of 12 teams submitted their results in different task/track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf bib
Word Embeddings-Based Uncertainty Detection in Financial Disclosures
Christoph Kilian Theil | Sanja Štajner | Heiner Stuckenschmidt
Proceedings of the First Workshop on Economics and Natural Language Processing

In this paper, we use NLP techniques to detect linguistic uncertainty in financial disclosures. Leveraging general-domain and domain-specific word embedding models, we automatically expand an existing dictionary of uncertainty triggers. We furthermore examine how an expert filtering affects the quality of such an expansion. We show that the dictionary expansions significantly improve regressions on stock return volatility. Lastly, we prove that the expansions significantly boost the automatic detection of uncertain sentences.

bib
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)
Arne Jönsson | Evelina Rennes | Horacio Saggion | Sanja Stajner | Victoria Yaneva
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
Improving Machine Translation of English Relative Clauses with Automatic Text Simplification
Sanja Štajner | Maja Popović
Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)

pdf bib
A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification
Sanja Štajner | Sergiu Nisioi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
CATS: A Tool for Customized Alignment of Text Simplification Corpora
Sanja Štajner | Marc Franco-Salvador | Paolo Rosso | Simone Paolo Ponzetto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Complex word identification (CWI) is an important task in text accessibility. However, due to the scarcity of CWI datasets, previous studies have only addressed this problem on Wikipedia sentences and have solely taken into account the needs of non-native English speakers. We collect a new CWI dataset (CWIG3G2) covering three text genres News, WikiNews, and Wikipedia) annotated by both native and non-native English speakers. Unlike previous datasets, we cover single words, as well as complex phrases, and present them for judgment in a paragraph context. We present the first study on cross-genre and cross-group CWI, showing measurable influences in native language and genre types.

pdf bib
Exploring Neural Text Simplification Models
Sergiu Nisioi | Sanja Štajner | Simone Paolo Ponzetto | Liviu P. Dinu
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated TS systems, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve almost perfect grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems

pdf bib
Sentence Alignment Methods for Improving Text Simplification Systems
Sanja Štajner | Marc Franco-Salvador | Simone Paolo Ponzetto | Paolo Rosso | Heiner Stuckenschmidt
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We provide several methods for sentence-alignment of texts with different complexity levels. Using the best of them, we sentence-align the Newsela corpora, thus providing large training materials for automatic text simplification (ATS) systems. We show that using this dataset, even the standard phrase-based statistical machine translation models for ATS can outperform the state-of-the-art ATS systems.

pdf bib
Multilingual and Cross-Lingual Complex Word Identification
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.

pdf bib
Effects of Lexical Properties on Viewing Time per Word in Autistic and Neurotypical Readers
Sanja Štajner | Victoria Yaneva | Ruslan Mitkov | Simone Paolo Ponzetto
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Eye tracking studies from the past few decades have shaped the way we think of word complexity and cognitive load: words that are long, rare and ambiguous are more difficult to read. However, online processing techniques have been scarcely applied to investigating the reading difficulties of people with autism and what vocabulary is challenging for them. We present parallel gaze data obtained from adult readers with autism and a control group of neurotypical readers and show that the former required higher cognitive effort to comprehend the texts as evidenced by three gaze-based measures. We divide all words into four classes based on their viewing times for both groups and investigate the relationship between longer viewing times and word length, word frequency, and four cognitively-based measures (word concreteness, familiarity, age of acquisition and imagability).

2016

pdf bib
Can Text Simplification Help Machine Translation?
Sanja Štajner | Maja Popovic
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

pdf bib
Use of Domain-Specific Language Resources in Machine Translation
Sanja Štajner | Andreia Querido | Nuno Rendeiro | João António Rodrigues | António Branco
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we address the problem of Machine Translation (MT) for a specialised domain in a language pair for which only a very small domain-specific parallel corpus is available. We conduct a series of experiments using a purely phrase-based SMT (PBSMT) system and a hybrid MT system (TectoMT), testing three different strategies to overcome the problem of the small amount of in-domain training data. Our results show that adding a small size in-domain bilingual terminology to the small in-domain training corpus leads to the best improvements of a hybrid MT system, while the PBSMT system achieves the best results by adding a combination of in-domain bilingual terminology and a larger out-of-domain corpus. We focus on qualitative human evaluation of the output of two best systems (one for each approach) and perform a systematic in-depth error analysis which revealed advantages of the hybrid MT system over the pure PBSMT system for this specific task.

pdf bib
Bootstrapping a Hybrid MT System to a New Language Pair
João António Rodrigues | Nuno Rendeiro | Andreia Querido | Sanja Štajner | António Branco
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The usual concern when opting for a rule-based or a hybrid machine translation (MT) system is how much effort is required to adapt the system to a different language pair or a new domain. In this paper, we describe a way of adapting an existing hybrid MT system to a new language pair, and show that such a system can outperform a standard phrase-based statistical machine translation system with an average of 10 persons/month of work. This is specifically important in the case of domain-specific MT for which there is not enough parallel data for training a statistical machine translation system.

2015

pdf bib
Machine Translation for Multilingual Troubleshooting in the IT Domain: A Comparison of Different Strategies
Sanja Štajner | João Rodrigues | Luís Gomes | António Branco
Proceedings of the 1st Deep Machine Translation Workshop

pdf bib
Translating from Original to Simplified Sentences using Moses: When does it Actually Work?
Sanja Štajner | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Automatic Text Simplification for Spanish: Comparative Evaluation of Various Simplification Strategies
Sanja Štajner | Iacer Calixto | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
Simplifying Lexical Simplification: Do We Need Simplified Corpora?
Goran Glavaš | Sanja Štajner
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
A Deeper Exploration of the Standard PB-SMT Approach to Text Simplification and its Evaluation
Sanja Štajner | Hannah Béchara | Horacio Saggion
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

pdf bib
One Step Closer to Automatic Evaluation of Text Simplification Systems
Sanja Štajner | Ruslan Mitkov | Horacio Saggion
Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)

pdf bib
The Fewer, the Better? A Contrastive Study about Ways to Simplify
Ruslan Mitkov | Sanja Štajner
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

pdf bib
Assessing Conformance of Manually Simplified Corpora with User Requirements: the Case of Autistic Readers
Sanja Štajner | Richard Evans | Iustin Dornescu
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

2013

pdf bib
Event-Centered Simplification of News Stories
Goran Glavaš | Sanja Štajner
Proceedings of the Student Research Workshop associated with RANLP 2013

pdf bib
Readability Indices for Automatic Evaluation of Text Simplification Systems: A Feasibility Study for Spanish
Sanja Štajner | Horacio Saggion
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach
Sanja Štajner | Ruslan Mitkov
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties - British and American, using the `Brown family' of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora.

2011

pdf bib
Towards a Better Exploitation of the Brown ‘Family’ Corpora in Diachronic Studies of British and American English Language Varieties
Sanja Štajner
Proceedings of the Second Student Research Workshop associated with RANLP 2011

pdf bib
Diachronic Stylistic Changes in British and American Varieties of 20th Century Written English Language
Sanja Štajner | Ruslan Mitkov
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage