Markus Forsberg

2023

We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating Swedish language models, a counterpart to the English-language (Super)GLUE suite. We describe the dataset, the tasks, the leaderboard and report the baseline results yielded by a reference implementation. The tested models do not approach ceiling performance on any of the tasks, which suggests that Superlim is truly difficult, a desirable quality for a benchmark. We address methodological challenges, such as mitigating the Anglocentric bias when creating datasets for a less-resourced language; choosing the most appropriate measures; documenting the datasets and making the leaderboard convenient and transparent. We also highlight other potential usages of the dataset, such as, for instance, the evaluation of cross-lingual transfer learning.

2021

pdf bib abs
A Data-Driven Semi-Automatic Framenet Development Methodology
Shafqat Mumtaz Virk | Dana Dannélls | Lars Borin | Markus Forsberg
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

FrameNet is a lexical semantic resource based on the linguistic theory of frame semantics. A number of framenet development strategies have been reported previously and all of them involve exploration of corpora and a fair amount of manual work. Despite previous efforts, there does not exist a well-thought-out automatic/semi-automatic methodology for frame construction. In this paper we propose a data-driven methodology for identification and semi-automatic construction of frames. As a proof of concept, we report on our initial attempts to build a wider-scale framenet for the legal domain (LawFN) using the proposed methodology. The constructed frames are stored in a lexical database and together with the annotated example sentences they have been made available through a web interface.

2020

pdf bib abs
From Linguistic Descriptions to Language Profiles
Shafqat Mumtaz Virk | Harald Hammarström | Lars Borin | Markus Forsberg | Søren Wichmann
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

Language catalogues and typological databases are two important types of resources containing different types of knowledge about the world’s natural languages. The former provide metadata such as number of speakers, location (in prose descriptions and/or GPS coordinates), language code, literacy, etc., while the latter contain information about a set of structural and functional attributes of languages. Given that both types of resources are developed and later maintained manually, there are practical limits as to the number of languages and the number of features that can be surveyed. We introduce the concept of a language profile, which is intended to be a structured representation of various types of knowledge about a natural language extracted semi-automatically from descriptive documents and stored at a central location. It has three major parts: (1) an introductory; (2) an attributive; and (3) a reference part, each containing different types of knowledge about a given natural language. As a case study, we develop and present a language profile of an example language. At this stage, a language profile is an independent entity, but in the future it is envisioned to become part of a network of language profiles connected to each other via various types of relations. Such a representation is expected to be suitable both for humans and machines to read and process for further deeper linguistic analyses and/or comparisons.

pdf bib abs
The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages
Shafqat Mumtaz Virk | Harald Hammarström | Markus Forsberg | Søren Wichmann
Proceedings of the Twelfth Language Resources and Evaluation Conference

There exist as many as 7000 natural languages in the world, and a huge number of documents describing those languages have been produced over the years. Most of those documents are in paper format. Any attempts to use modern computational techniques and tools to process those documents will require them to be digitized first. In this paper, we report a multilingual digitized version of thousands of such documents searchable through some well-established corpus infrastructures. The corpus is annotated with various meta, word, and text level attributes to make searching and analysis easier and more useful.

2016

pdf bib
Domain-specific multilingual translation for producers of information
Aarne Ranta | Kasimir Angelov | Markus Forsberg | Thomas Hallgren
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

pdf bib abs
Deriving Morphological Analyzers from Example Inflections
Markus Forsberg | Mans Hulden
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a semi-automatic method to derive morphological analyzers from a limited number of example inflections suitable for languages with alphabetic writing systems. The system we present learns the inflectional behavior of morphological paradigms from examples and converts the learned paradigms into a finite-state transducer that is able to map inflected forms of previously unseen words into lemmas and corresponding morphosyntactic descriptions. We evaluate the system when provided with inflection tables for several languages collected from the Wiktionary.

pdf bib
Learning Transducer Models for Morphological Analysis from Example Inflections
Markus Forsberg | Mans Hulden
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

2015

pdf bib
Paradigm classification in supervised learning of morphology
Malin Ahlberg | Markus Forsberg | Mans Hulden
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A case study on supervised classification of Swedish pseudo-coordination
Malin Ahlberg | Peter Andersson | Markus Forsberg | Nina Tahmasebi
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

2014

pdf bib
Semi-supervised learning of morphological paradigms and lexicons
Mans Hulden | Markus Forsberg | Malin Ahlberg
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

In this paper we describe and evaluate a tool for paradigm induction and lexicon extraction that has been applied to Old Swedish. The tool is semi-supervised and uses a small seed lexicon and unannotated corpora to derive full inflection tables for input lemmata. In the work presented here, the tool has been modified to deal with the rich spelling variation found in Old Swedish texts. We also present some initial experiments, which are the first steps towards creating a large-scale morphology for Old Swedish.

pdf bib
Swesaurus; or, The Frankenstein Approach to Wordnet Construction
Lars Borin | Markus Forsberg
Proceedings of the Seventh Global Wordnet Conference

2013

2012

pdf bib abs
Cloud Logic Programming for Integrating Language Technology Resources
Markus Forsberg | Torbjörn Lager
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The main goal of the CLT Cloud project is to equip lexica, morphological processors, parsers and other software components developed within CLT (Centre of Language Technology) with so called web API:s, thus making them available on the Internet in the form of web services. We present a proof-of-concept implementation of the CLT Cloud server where we use the logic programming language Prolog for composing and aggregating existing web services into new web services in a way that encourages creative exploration and rapid prototyping of LT applications.

pdf bib abs
Korp — the corpus infrastructure of Språkbanken
Lars Borin | Markus Forsberg | Johan Roxendal
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present Korp, the corpus infrastructure of Språkbanken (the Swedish Language Bank). The infrastructure consists of three main components: the Korp corpus pipeline, the Korp backend, and the Korp frontend. The Korp corpus pipeline is used for importing corpora, annotating them, and then exporting the annotated corpora into different formats. An essential feature of the pipeline is the ability to leave existing annotations untouched, both structural and word level annotations, and to use the existing annotations as the foundation of other annotations. The Korp backend consists of a set of REST-based web services for searching in and retrieving information about the corpora. Finally, the Korp frontend is a graphical search interface that interacts with the Korp backend. The interface has been inspired by corpus search interfaces such as SketchEngine, Glossa, and DeepDict, and it uses State Chart XML (SCXML) in order to enable users to bookmark interaction states. We give a functional and technical overview of the three components, followed by a discussion of planned future work.

pdf bib abs
The open lexical infrastructure of Språkbanken
Lars Borin | Markus Forsberg | Leif-Jöran Olsson | Jonatan Uppström
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present our ongoing work on Karp, Språkbanken's (the Swedish Language Bank) open lexical infrastructure, which has two main functions: (1) to support the work on creating, curating, and integrating our various lexical resources; and (2) to publish daily versions of the resources, making them searchable and downloadable. An important requirement on the lexical infrastructure is also that we maintain a strong bidirectional connection to our corpus infrastructure. At the heart of the infrastructure is the SweFN++ project with the goal to create free Swedish lexical resources geared towards language technology applications. The infrastructure currently hosts 15 Swedish lexical resources, including historical ones, some of which have been created from scratch using existing free resources, both external and in-house. The resources are integrated through links to a pivot lexical resource, SALDO, a large morphological and lexical-semantic resource for modern Swedish. SALDO has been selected as the pivot partly because of its size and quality, but also because its form and sense units have been assigned persistent identifiers (PIDs) to which the lexical information in other lexical resources and in corpora are linked.

The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.

pdf bib
Search Result Diversification Methods to Assist Lexicographers
Lars Borin | Markus Forsberg | Karin Friberg Heppin | Richard Johansson | Annika Kjellandsson
Proceedings of the Sixth Linguistic Annotation Workshop

2011

pdf bib
Semantic search in literature as an e-Humanities research tool: CONPLISIT – Consumption patterns and life-style in 19th century Swedish literature
Lars Borin | Markus Forsberg | Christer Ahlberger
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib abs
Diabase: Towards a Diachronic BLARK in Support of Historical Studies
Lars Borin | Markus Forsberg | Dimitrios Kokkinakis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present our ongoing work on language technology-based e-science in the humanities, social sciences and education, with a focus on text-based research in the historical sciences. An important aspect of language technology is the research infrastructure known by the acronym BLARK (Basic LAnguage Resource Kit). A BLARK as normally presented in the literature arguably reflects a modern standard language, which is topic- and genre-neutral, thus abstracting away from all kinds of language variation. We argue that this notion could fruitfully be extended along any of the three axes implicit in this characterization (the social, the topical and the temporal), in our case the temporal axis, towards a diachronic BLARK for Swedish, which can be used to develop e-science tools in support of historical studies.