Gerlof Bouma

2023

We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating Swedish language models, a counterpart to the English-language (Super)GLUE suite. We describe the dataset, the tasks, the leaderboard and report the baseline results yielded by a reference implementation. The tested models do not approach ceiling performance on any of the tasks, which suggests that Superlim is truly difficult, a desirable quality for a benchmark. We address methodological challenges, such as mitigating the Anglocentric bias when creating datasets for a less-resourced language; choosing the most appropriate measures; documenting the datasets and making the leaderboard convenient and transparent. We also highlight other potential usages of the dataset, such as, for instance, the evaluation of cross-lingual transfer learning.

pdf
DaLAJ-GED - a dataset for Grammatical Error Detection tasks on Swedish
Elena Volodina | Yousuf Ali Mohammed | Aleksandrs Berdicevskis | Gerlof Bouma | Joey Öhman
Proceedings of the 12th Workshop on NLP for Computer Assisted Language Learning

2021

pdf abs
The Swedish Winogender Dataset
Saga Hansson | Konstantinos Mavromatakis | Yvonne Adesam | Gerlof Bouma | Dana Dannélls
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We introduce the SweWinogender test set, a diagnostic dataset to measure gender bias in coreference resolution. It is modelled after the English Winogender benchmark, and is released with reference statistics on the distribution of men and women between occupations and the association between gender and occupation in modern corpus material. The paper discusses the design and creation of the dataset, and presents a small investigation of the supplementary statistics.

2020

pdf abs
The EDGeS Diachronic Bible Corpus
Gerlof Bouma | Evie Coussé | Trude Dijkstra | Nicoline van der Sijs
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the EDGeS Diachronic Bible Corpus: a diachronically and synchronically parallel corpus of Bible translations in Dutch, English, German and Swedish, with texts from the 14th century until today. It is compiled in the context of an intended longitudinal and contrastive study of complex verb constructions in Germanic. The paper discusses the corpus design principles, its selection of 36 Bibles, and the information and metadata encoded for the corpus texts. The EDGeS corpus will be available in two forms: the whole corpus will be accessible for researchers behind a login in the well-known OPUS search infrastructure, and the open subpart of the corpus will be available for download.

2017

pdf bib
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
Gerlof Bouma | Yvonne Adesam
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

2016

pdf
Old Swedish Part-of-Speech Tagging between Variation and External Knowledge
Yvonne Adesam | Gerlof Bouma
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf abs
A Multi-domain Corpus of Swedish Word Sense Annotation
Richard Johansson | Yvonne Adesam | Gerlof Bouma | Karin Hedberg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe the word sense annotation layer in Eukalyptus, a freely available five-domain corpus of contemporary Swedish with several annotation layers. The annotation uses the SALDO lexicon to define the sense inventory, and allows word sense annotation of compound segments and multiword units. We give an overview of the new annotation tool developed for this project, and finally present an analysis of the inter-annotator agreement between two annotators.

2015

pdf
Defining the Eukalyptus forest – the Koala treebank of Swedish
Yvonne Adesam | Gerlof Bouma | Richard Johansson
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

In this paper we describe and evaluate a tool for paradigm induction and lexicon extraction that has been applied to Old Swedish. The tool is semi-supervised and uses a small seed lexicon and unannotated corpora to derive full inflection tables for input lemmata. In the work presented here, the tool has been modified to deal with the rich spelling variation found in Old Swedish texts. We also present some initial experiments, which are the first steps towards creating a large-scale morphology for Old Swedish.

2012

pdf bib
A Best-First Anagram Hashing Filter for Approximate String Matching with Generalized Edit Distance
Malin Ahlberg | Gerlof Bouma
Proceedings of COLING 2012: Posters

2010

pdf
Collocation Extraction beyond the Independence Assumption
Gerlof Bouma
Proceedings of the ACL 2010 Conference Short Papers

pdf
Syntactic Tree Queries in Prolog
Gerlof Bouma
Proceedings of the Fourth Linguistic Annotation Workshop

pdf abs
Towards a Large Parallel Corpus of Cleft Constructions
Gerlof Bouma | Lilja Øvrelid | Jonas Kuhn
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present our efforts to create a large-scale, semi-automatically annotated parallel corpus of cleft constructions. The corpus is intended to reduce or make more effective the manual task of finding examples of clefts in a corpus. The corpus is being developed in the context of the Collaborative Research Centre SFB 632, which is a large, interdisciplinary research initiative to study information structure, at the University of Potsdam and the Humboldt University in Berlin. The corpus is based on the Europarl corpus (version 3). We show how state-of-the-art NLP tools, like POS taggers and statistical dependency parsers, may facilitate powerful and precise searches. We argue that identifying clefts using automatically added syntactic structure annotation is ultimately to be preferred over using lower level, though more robust, extraction methods like regular expression matching. An evaluation of the extraction method for one of the languages also offers some support for this method. We end the paper by discussing the resulting corpus itself. We present some examples of interesting clefts and translational counterparts from the corpus and suggest ways of exploiting our newly created resource in the cross-linguistic study of clefts.