Evangelia Gogoulou

2024

This paper details the process of developing the first native large generative language model for the North Germanic languages, GPT-SW3. We cover all parts of the development process, from data collection and processing, training configuration and instruction finetuning, to evaluation, applications, and considerations for release strategies. We discuss pros and cons of developing large language models for smaller languages and in relatively peripheral regions of the globe, and we hope that this paper can serve as a guide and reference for other researchers that undertake the development of large generative models for smaller languages.

2023

pdf abs
On the Concept of Resource-Efficiency in NLP
Luise Dürlich | Evangelia Gogoulou | Joakim Nivre
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Resource-efficiency is a growing concern in the NLP community. But what are the resources we care about and why? How do we measure efficiency in a way that is reliable and relevant? And how do we balance efficiency and other important concerns? Based on a review of the emerging literature on the subject, we discuss different ways of conceptualizing efficiency in terms of product and cost, using a simple case study on fine-tuning and knowledge distillation for illustration. We propose a novel metric of amortized efficiency that is better suited for life-cycle analysis than existing metrics.

2022

pdf abs
Cross-lingual Transfer of Monolingual Models
Evangelia Gogoulou | Ariel Ekgren | Tim Isbister | Magnus Sahlgren
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Recent studies in cross-lingual learning using multilingual models have cast doubt on the previous hypothesis that shared vocabulary and joint pre-training are the keys to cross-lingual generalization. We introduce a method for transferring monolingual models to other languages through continuous pre-training and study the effects of such transfer from four different languages to English. Our experimental results on GLUE show that the transferred models outperform an English model trained from scratch, independently of the source language. After probing the model representations, we find that model knowledge from the source language enhances the learning of syntactic and semantic knowledge in English.

We present GTP-SW3, a 3.5 billion parameter autoregressive language model, trained on a newly created 100 GB Swedish corpus. This paper provides insights with regards to data collection and training, while highlights the challenges of proper model evaluation. The results of quantitive evaluation through perplexity indicate that GPT-SW3 is a competent model in comparison with existing autoregressive models of similar size. Additionally, we perform an extensive prompting study which reveals the good text generation capabilities of GTP-SW3.

2021

pdf abs
Predicting Treatment Outcome from Patient Texts:The Case of Internet-Based Cognitive Behavioural Therapy
Evangelia Gogoulou | Magnus Boman | Fehmi Ben Abdesslem | Nils Hentati Isacsson | Viktor Kaldo | Magnus Sahlgren
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

We investigate the feasibility of applying standard text categorisation methods to patient text in order to predict treatment outcome in Internet-based cognitive behavioural therapy. The data set is unique in its detail and size for regular care for depression, social anxiety, and panic disorder. Our results indicate that there is a signal in the depression data, albeit a weak one. We also perform terminological and sentiment analysis, which confirm those results.

2020

pdf abs
SenseCluster at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Amaru Cuba Gyllensten | Evangelia Gogoulou | Ariel Ekgren | Magnus Sahlgren
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We (Team Skurt) propose a simple method to detect lexical semantic change by clustering contextualized embeddings produced by XLM-R, using K-Means++. The basic idea is that contextualized embeddings that encode the same sense are located in close proximity in the embedding space. Our approach is both simple and generic, but yet performs relatively good in both sub-tasks of SemEval-2020 Task 1. We hypothesize that the main shortcoming of our method lies in the simplicity of the clustering method used.