Måns Magnusson


2025

pdf bib
Detecting Legal Citations in United Kingdom Court Judgments
Holli Sargeant | Andreas Östling | Måns Magnusson
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Legal citation detection in court judgments underpins reliable precedent mapping, citation analytics, and document retrieval. Extracting references to legislation and case law in the United Kingdom is especially challenging: citation styles have evolved over centuries, and judgments routinely cite foreign or historical authorities. We conduct the first systematic comparison of three modelling paradigms on this task using the Cambridge Law Corpus: (i) rule‐based regular expressions; (ii) transformer-based encoders (BERT, RoBERTa, LEGAL‐BERT, ModernBERT); and (iii) large language models (GPT‐4.1). We produced a gold‐standard high-quality corpus of 190 court judgments containing 45,179 fine-grained annotations for UK and non-UK legislation and case references. ModernBERT achieves a macro-averaged F1 of 93.3%, only marginally ahead of the other encoder-only models, yet significantly outperforming the strongest regular-expression baseline (35.42% F1) and GPT-4.1 (76.57% F1).

2024

pdf bib
The Swedish Parliament Corpus 1867 – 2022
Väinö Aleksi Yrjänäinen | Fredrik Mohammadi Norén | Robert Borges | Johan Jarlbrink | Lotta Åberg Brorsson | Anders P. Olsson | Pelle Snickars | Måns Magnusson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Swedish parliamentary records are an important source material for social science and humanities researchers. We introduce a new research corpus, the Swedish Parliament Corpus, which is larger and more developed than previously available research corpora for the Swedish parliament. The corpus contains annotated and structured parliamentary records over more than 150 years, through the bicameral parliament (1867–1970) and the unicameral parliament (1971–). In addition to the records, which contain all speeches in the parliament, we also provide a database of all members of parliament over the same period. Along with the corpus, we describe procedures to ensure data quality. The corpus facilitates detailed analysis of parliamentary speeches in several research fields.

2020

pdf bib
Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
Alexander Terenin | Måns Magnusson | Leif Jonsson
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

To scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model. This sampler utilizes all available sources of sparsity found in natural language - an important way to make computation efficient. We benchmark our method on a well-known corpus (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under four days.

2019

pdf bib
Interpretable Word Embeddings via Informative Priors
Miriam Hurtado Bodell | Martin Arvidsson | Måns Magnusson
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings have demonstrated strong performance on NLP tasks. However, lack of interpretability and the unsupervised nature of word embeddings have limited their use within computational social science and digital humanities. We propose the use of informative priors to create interpretable and domain-informed dimensions for probabilistic word embeddings. Experimental results show that sensible priors can capture latent semantic concepts better than or on-par with the current state of the art, while retaining the simplicity and generalizability of using priors.

2017

pdf bib
Pulling Out the Stops: Rethinking Stopword Removal for Topic Models
Alexandra Schofield | Måns Magnusson | David Mimno
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

It is often assumed that topic models benefit from the use of a manually curated stopword list. Constructing this list is time-consuming and often subject to user judgments about what kinds of words are important to the model and the application. Although stopword removal clearly affects which word types appear as most probable terms in topics, we argue that this improvement is superficial, and that topic inference benefits little from the practice of removing stopwords beyond very frequent terms. Removing corpus-specific stopwords after model inference is more transparent and produces similar results to removing those words prior to inference.