Magnus Sahlgren

2021

pdf bib abs
It’s Basically the Same Language Anyway: the Case for a Nordic Language Model
Magnus Sahlgren | Fredrik Carlsson | Fredrik Olsson | Love Börjeson
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

When is it beneficial for a research community to organize a broader collaborative effort on a topic, and when should we instead promote individual efforts? In this opinion piece, we argue that we are at a stage in the development of large-scale language models where a collaborative effort is desirable, despite the fact that the preconditions for making individual contributions have never been better. We consider a number of arguments for collaboratively developing a large-scale Nordic language model, include environmental considerations, cost, data availability, language typology, cultural similarity, and transparency. Our primary goal is to raise awareness and foster a discussion about our potential impact and responsibility as NLP community.

pdf bib abs
Decentralized Word2Vec Using Gossip Learning
Abdul Aziz Alkathiri | Lodovico Giaretta | Sarunas Girdzijauskas | Magnus Sahlgren
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Advanced NLP models require huge amounts of data from various domains to produce high-quality representations. It is useful then for a few large public and private organizations to join their corpora during training. However, factors such as legislation and user emphasis on data privacy may prevent centralized orchestration and data sharing among these organizations. Therefore, for this specific scenario, we investigate how gossip learning, a massively-parallel, data-private, decentralized protocol, compares to a shared-dataset solution. We find that the application of Word2Vec in a gossip learning framework is viable. Without any tuning, the results are comparable to a traditional centralized setting, with a loss of quality as low as 4.3%. Furthermore, the results are up to 54.8% better than independent local training.

pdf bib abs
Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?
Tim Isbister | Fredrik Carlsson | Magnus Sahlgren
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English language models. We demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages. This paper therefore strives to make a provocative but important point. As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build language models for such languages.

pdf bib abs
GANDALF: a General Character Name Description Dataset for Long Fiction
Fredrik Carlsson | Magnus Sahlgren | Fredrik Olsson | Amaru Cuba Gyllensten
Proceedings of the 3rd Workshop on Machine Reading for Question Answering

This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.

pdf bib abs
Predicting Treatment Outcome from Patient Texts:The Case of Internet-Based Cognitive Behavioural Therapy
Evangelia Gogoulou | Magnus Boman | Fehmi Ben Abdesslem | Nils Hentati Isacsson | Viktor Kaldo | Magnus Sahlgren
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We investigate the feasibility of applying standard text categorisation methods to patient text in order to predict treatment outcome in Internet-based cognitive behavioural therapy. The data set is unique in its detail and size for regular care for depression, social anxiety, and panic disorder. Our results indicate that there is a signal in the depression data, albeit a weak one. We also perform terminological and sentiment analysis, which confirm those results.

2020

pdf bib abs
Text Categorization for Conflict Event Annotation
Fredrik Olsson | Magnus Sahlgren | Fehmi ben Abdesslem | Ariel Ekgren | Kristine Eck
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP). Annotating a single text involves assigning the labels pertaining to at least 17 distinct categorization tasks, e.g., who were the attacking organization, who was attacked, and where did the event take place. The text categorization techniques under scrutiny are a classical Bag-of-Words approach; character-based contextualized embeddings produced by ELMo; embeddings produced by the BERT base model, and a version of BERT base fine-tuned on UCDP data; and a pre-trained and fine-tuned classifier based on ULMFiT. The categorization tasks are very diverse in terms of the number of classes to predict as well as the skeweness of the distribution of classes. The categorization results exhibit a large variability across tasks, ranging from 30.3% to 99.8% F-score.

pdf bib abs
SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Amaru Cuba Gyllensten | Evangelia Gogoulou | Ariel Ekgren | Magnus Sahlgren
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.

pdf bib abs
Rethinking Topic Modelling: From Document-Space to Term-Space
Magnus Sahlgren
Findings of the Association for Computational Linguistics: EMNLP 2020

This paper problematizes the reliance on documents as the basic notion for defining term interactions in standard topic models. As an alternative to this practice, we reformulate topic distributions as latent factors in term similarity space. We exemplify the idea using a number of standard word embeddings built with very wide context windows. The embedding spaces are transformed to sparse similarity spaces, and topics are extracted in standard fashion by factorizing to a lower-dimensional space. We use a number of different factorization techniques, and evaluate the various models using a large set of evaluation metrics, including previously published coherence measures, as well as a number of novel measures that we suggest better correspond to real-world applications of topic models. Our results clearly demonstrate that term-based models outperform standard document-based models by a large margin.

2019

pdf bib abs
R-grams: Unsupervised Learning of Semantic Units in Natural Language
Amaru Cuba Gyllensten | Ariel Ekgren | Magnus Sahlgren
Proceedings of the 13th International Conference on Computational Semantics - Student Papers

This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.

pdf bib abs
Gender Bias in Pretrained Swedish Embeddings
Magnus Sahlgren | Fredrik Olsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper investigates the presence of gender bias in pretrained Swedish embeddings. We focus on a scenario where names are matched with occupations, and we demonstrate how a number of standard pretrained embeddings handle this task. Our experiments show some significant differences between the pretrained embeddings, with word-based methods showing the most bias and contextualized language models showing the least. We also demonstrate that the previously proposed debiasing method does not affect the performance of the various embeddings in this scenario.

2018

pdf bib abs
Learning Representations for Detecting Abusive Language
Magnus Sahlgren | Tim Isbister | Fredrik Olsson
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)

This paper discusses the question whether it is possible to learn a generic representation that is useful for detecting various types of abusive language. The approach is inspired by recent advances in transfer learning and word embeddings, and we learn representations from two different datasets containing various degrees of abusive language. We compare the learned representation with two standard approaches; one based on lexica, and one based on data-specific n-grams. Our experiments show that learned representations do contain useful information that can be used to improve detection performance when training data is limited.

pdf bib abs
Measuring Issue Ownership using Word Embeddings
Amaru Cuba Gyllensten | Magnus Sahlgren
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, “what is being talked about, regarding X”, and “what do people feel, regarding X”. In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are concepts from political science that have been used to explain voter choice and electoral outcomes. We argue that issue alignment and agenda setting can be seen as a kind of semantic source similarity of the kind “how similar is source A to issue owner P, when talking about issue X”, and as such can be measured using word/document embedding techniques. We present work in progress towards measuring that kind of conditioned similarity, and introduce a new notion of similarity for predictive embeddings. We then test this method by measuring the similarity between politically aligned media and political parties, conditioned on bloc-specific issues.

pdf bib
Distributional Term Set Expansion
Amaru Cuba Gyllensten | Magnus Sahlgren
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Parameterized context windows in Random Indexing
Tobias Norlund | David Nilsson | Magnus Sahlgren
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf bib
Unshared task: (Dis)agreement in online debates
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)

pdf bib abs
Active learning for detection of stance components
Maria Skeppstedt | Magnus Sahlgren | Carita Paradis | Andreas Kerren
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

Automatic detection of five language components, which are all relevant for expressing opinions and for stance taking, was studied: positive sentiment, negative sentiment, speculation, contrast and condition. A resource-aware approach was taken, which included manual annotation of 500 training samples and the use of limited lexical resources. Active learning was compared to random selection of training data, as well as to a lexicon-based method. Active learning was successful for the categories speculation, contrast and condition, but not for the two sentiment categories, for which results achieved when using active learning were similar to those achieved when applying a random selection of training data. This difference is likely due to a larger variation in how sentiment is expressed than in how speakers express the other three categories. This larger variation was also shown by the lower recall results achieved by the lexicon-based approach for sentiment than for the categories speculation, contrast and condition.

pdf bib
The Effects of Data Size and Frequency Range on Distributional Semantic Models
Magnus Sahlgren | Alessandro Lenci
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.

This paper discusses evaluation methodologies for a particular kind of meaning models known as word-space models, which use distributional information to assemble geometric representations of meaning similarities. Word-space models have received considerable attention in recent years, and have begun to see employment outside the walls of computational linguistics laboratories. However, the evaluation methodologies of such models remain infantile, and lack efforts at standardization. Very few studies have critically assessed the methodologies used to evaluate word spaces. This paper attempts to fill some of this void. It is the central goal of this paper to answer the question how can we determine whether a given word space is a good word space?

pdf bib
Creating bilingual lexica using reference wordlists for alignment of monolingual semantic vector spaces
Jon Holmlund | Magnus Sahlgren | Jussi Karlgren
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)