2024
pdf
abs
GPT-SW3: An Autoregressive Language Model for the Scandinavian Languages
Ariel Ekgren
|
Amaru Cuba Gyllensten
|
Felix Stollenwerk
|
Joey Öhman
|
Tim Isbister
|
Evangelia Gogoulou
|
Fredrik Carlsson
|
Judit Casademont
|
Magnus Sahlgren
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper details the process of developing the first native large generative language model for the North Germanic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation, applications, and considerations for release strategies. We discuss pros and cons of developing large language models for smaller languages and in relatively peripheral regions of the globe, and we hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.
2022
pdf
abs
Lessons Learned from GPT-SW3: Building the First Large-Scale Generative Language Model for Swedish
Ariel Ekgren
|
Amaru Cuba Gyllensten
|
Evangelia Gogoulou
|
Alice Heiman
|
Severine Verlinden
|
Joey Öhman
|
Fredrik Carlsson
|
Magnus Sahlgren
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present GTP-SW3, a 3.5 billion parameter autoregressive language model, trained on a newly created 100 GB Swedish corpus. This paper provides insights with regards to data collection and training, while highlights the challenges of proper model evaluation. The results of quantitive evaluation through perplexity indicate that GPT-SW3 is a competent model in comparison with existing autoregressive models of similar size. Additionally, we perform an extensive prompting study which reveals the good text generation capabilities of GTP-SW3.
2021
pdf
abs
GANDALF: a General Character Name Description Dataset for Long Fiction
Fredrik Carlsson
|
Magnus Sahlgren
|
Fredrik Olsson
|
Amaru Cuba Gyllensten
Proceedings of the 3rd Workshop on Machine Reading for Question Answering
This paper introduces a long-range multiple-choice Question Answering (QA) dataset, based on full-length fiction book texts. The questions are formulated as 10-way multiple-choice questions, where the task is to select the correct character name given a character description, or vice-versa. Each character description is formulated in natural text and often contains information from several sections throughout the book. We provide 20,000 questions created from 10,000 manually annotated descriptions of characters from 177 books containing 152,917 words on average. We address the current discourse regarding dataset bias and leakage by a simple anonymization procedure, which in turn enables interesting probing possibilities. Finally, we show that suitable baseline algorithms perform very poorly on this task, with the book size itself making it non-trivial to attempt a Transformer-based QA solution. This leaves ample room for future improvement, and hints at the need for a completely different type of solution.
2020
pdf
abs
SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Amaru Cuba Gyllensten
|
Evangelia Gogoulou
|
Ariel Ekgren
|
Magnus Sahlgren
Proceedings of the Fourteenth Workshop on Semantic Evaluation
We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.
2019
pdf
abs
R-grams: Unsupervised Learning of Semantic Units in Natural Language
Amaru Cuba Gyllensten
|
Ariel Ekgren
|
Magnus Sahlgren
Proceedings of the 13th International Conference on Computational Semantics - Student Papers
This paper investigates data-driven segmentation using Re-Pair or Byte Pair Encoding-techniques. In contrast to previous work which has primarily been focused on subword units for machine translation, we are interested in the general properties of such segments above the word level. We call these segments r-grams, and discuss their properties and the effect they have on the token frequency distribution. The proposed approach is evaluated by demonstrating its viability in embedding techniques, both in monolingual and multilingual test settings. We also provide a number of qualitative examples of the proposed methodology, demonstrating its viability as a language-invariant segmentation procedure.
2018
pdf
Distributional Term Set Expansion
Amaru Cuba Gyllensten
|
Magnus Sahlgren
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
abs
Measuring Issue Ownership using Word Embeddings
Amaru Cuba Gyllensten
|
Magnus Sahlgren
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Sentiment and topic analysis are common methods used for social media monitoring. Essentially, these methods answers questions such as, “what is being talked about, regarding X”, and “what do people feel, regarding X”. In this paper, we investigate another venue for social media monitoring, namely issue ownership and agenda setting, which are concepts from political science that have been used to explain voter choice and electoral outcomes. We argue that issue alignment and agenda setting can be seen as a kind of semantic source similarity of the kind “how similar is source A to issue owner P, when talking about issue X”, and as such can be measured using word/document embedding techniques. We present work in progress towards measuring that kind of conditioned similarity, and introduce a new notion of similarity for predictive embeddings. We then test this method by measuring the similarity between politically aligned media and political parties, conditioned on bloc-specific issues.
2016
pdf
abs
The Gavagai Living Lexicon
Magnus Sahlgren
|
Amaru Cuba Gyllensten
|
Fredrik Espinoza
|
Ola Hamfors
|
Jussi Karlgren
|
Fredrik Olsson
|
Per Persson
|
Akshay Viswanathan
|
Anders Holst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper presents the Gavagai Living Lexicon, which is an online distributional semantic model currently available in 20 different languages. We describe the underlying distributional semantic model, and how we have solved some of the challenges in applying such a model to large amounts of streaming data. We also describe the architecture of our implementation, and discuss how we deal with continuous quality assurance of the lexicon.
2015
pdf
Navigating the Semantic Horizon using Relative Neighborhood Graphs
Amaru Cuba Gyllensten
|
Magnus Sahlgren
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing