2021
pdf
abs
GANDALF: a General Character Name Description Dataset for Long Fiction
Fredrik Carlsson
|
Magnus Sahlgren
|
Fredrik Olsson
|
Amaru Cuba Gyllensten
Proceedings of the 3rd Workshop on Machine Reading for Question Answering
This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.
pdf
abs
It’s Basically the Same Language Anyway: the Case for a Nordic Language Model
Magnus Sahlgren
|
Fredrik Carlsson
|
Fredrik Olsson
|
Love Börjeson
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
When is it beneficial for a research community to organize a broader collaborative effort on a topic, and when should we instead promote individual efforts? In this opinion piece, we argue that we are at a stage in the development of large-scale language models where a collaborative effort is desirable, despite the fact that the preconditions for making individual contributions have never been better. We consider a number of arguments for collaboratively developing a large-scale Nordic language model, include environmental considerations, cost, data availability, language typology, cultural similarity, and transparency. Our primary goal is to raise awareness and foster a discussion about our potential impact and responsibility as NLP community.
2020
pdf
abs
Text Categorization for Conflict Event Annotation
Fredrik Olsson
|
Magnus Sahlgren
|
Fehmi ben Abdesslem
|
Ariel Ekgren
|
Kristine Eck
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP). Annotating a single text involves assigning the labels pertaining to at least 17 distinct categorization tasks, e.g., who were the attacking organization, who was attacked, and where did the event take place. The text categorization techniques under scrutiny are a classical Bag-of-Words approach; character-based contextualized embeddings produced by ELMo; embeddings produced by the BERT base model, and a version of BERT base fine-tuned on UCDP data; and a pre-trained and fine-tuned classifier based on ULMFiT. The categorization tasks are very diverse in terms of the number of classes to predict as well as the skeweness of the distribution of classes. The categorization results exhibit a large variability across tasks, ranging from 30.3% to 99.8% F-score.
2019
pdf
abs
Gender Bias in Pretrained Swedish Embeddings
Magnus Sahlgren
|
Fredrik Olsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics
This paper investigates the presence of gender bias in pretrained Swedish embeddings. We focus on a scenario where names are matched with occupations, and we demonstrate how a number of standard pretrained embeddings handle this task. Our experiments show some significant differences between the pretrained embeddings, with word-based methods showing the most bias and contextualized language models showing the least. We also demonstrate that the previously proposed debiasing method does not affect the performance of the various embeddings in this scenario.
2018
pdf
abs
Learning Representations for Detecting Abusive Language
Magnus Sahlgren
|
Tim Isbister
|
Fredrik Olsson
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
This paper discusses the question whether it is possible to learn a generic representation that is useful for detecting various types of abusive language. The approach is inspired by recent advances in transfer learning and word embeddings, and we learn representations from two different datasets containing various degrees of abusive language. We compare the learned representation with two standard approaches; one based on lexica, and one based on data-specific n-grams. Our experiments show that learned representations do contain useful information that can be used to improve detection performance when training data is limited.
2016
pdf
abs
The Gavagai Living Lexicon
Magnus Sahlgren
|
Amaru Cuba Gyllensten
|
Fredrik Espinoza
|
Ola Hamfors
|
Jussi Karlgren
|
Fredrik Olsson
|
Per Persson
|
Akshay Viswanathan
|
Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.
2009
pdf
Methods for Amharic Part-of-Speech Tagging
Björn Gambäck
|
Fredrik Olsson
|
Atelach Alemu Argaw
|
Lars Asker
Proceedings of the First Workshop on Language Technologies for African Languages
pdf
An Intrinsic Stopping Criterion for Committee-Based Active Learning
Fredrik Olsson
|
Katrin Tomanek
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)
pdf
A Web Survey on the Use of Active Learning to Support Annotation of Text Data
Katrin Tomanek
|
Fredrik Olsson
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
2002
pdf
Notions of Correctness when Evaluating Protein Name Taggers
Fredrik Olsson
|
Gunnar Eriksson
|
Kristofer Franzén
|
Lars Asker
|
Per Lidén
COLING 2002: The 19th International Conference on Computational Linguistics
2000
pdf
Experiences of Language Engineering Algorithm Reuse
Björn Gambäck
|
Fredrik Olsson
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
pdf
bib
Composing a General-Purpose Toolbox for Swedish
Fredrik Olsson
|
Björn Gambäck
Proceedings of the COLING-2000 Workshop on Using Toolsets and Architectures To Build NLP Systems