2024
pdf
abs
Post-OCR Correction of Digitized Swedish Newspapers with ByT5
Viktoria Löfgren
|
Dana Dannélls
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Many collections of digitized newspapers suffer from poor OCR quality, which impacts readability, information retrieval, and analysis of the material. Errors in OCR output can be reduced by applying machine translation models to “translate” it into a corrected version. Although transformer models show promising results in post-OCR correction and related tasks in other languages, they have not yet been explored for correcting OCR errors in Swedish texts. This paper presents a post-OCR correction model for Swedish 19th to 21th century newspapers based on the pre-trained transformer model ByT5. Three versions of the model were trained on different mixes of training data. The best model, which achieved a 36% reduction in CER, is made freely available and will be integrated into the automatic processing pipeline of Sprakbanken Text, a Swedish language technology infrastructure containing modern and historical written data.
pdf
abs
Transformer-based Swedish Semantic Role Labeling through Transfer Learning
Dana Dannélls
|
Richard Johansson
|
Lucy Yang Buhr
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Semantic Role Labeling (SRL) is a task in natural language understanding where the goal is to extract semantic roles for a given sentence. English SRL has achieved state-of-the-art performance using Transformer techniques and supervised learning. However, this technique is not a viable choice for smaller languages like Swedish due to the limited amount of training data. In this paper, we present the first effort in building a Transformer-based SRL system for Swedish by exploring multilingual and cross-lingual transfer learning methods and leveraging the Swedish FrameNet resource. We demonstrate that multilingual transfer learning outperforms two different cross-lingual transfer models. We also found some differences between frames in FrameNet that can either hinder or enhance the model’s performance. The resulting end-to-end model is freely available and will be made accessible through Språkbanken Text’s research infrastructure.
2023
pdf
abs
Superlim: A Swedish Language Understanding Evaluation Benchmark
Aleksandrs Berdicevskis
|
Gerlof Bouma
|
Robin Kurtz
|
Felix Morger
|
Joey Öhman
|
Yvonne Adesam
|
Lars Borin
|
Dana Dannélls
|
Markus Forsberg
|
Tim Isbister
|
Anna Lindahl
|
Martin Malmsten
|
Faton Rekathati
|
Magnus Sahlgren
|
Elena Volodina
|
Love Börjeson
|
Simon Hengchen
|
Nina Tahmasebi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating Swedish language models, a counterpart to the English-language (Super)GLUE suite. We describe the dataset, the tasks, the leaderboard and report the baseline results yielded by a reference implementation. The tested models do not approach ceiling performance on any of the tasks, which suggests that Superlim is truly difficult, a desirable quality for a benchmark. We address methodological challenges, such as mitigating the Anglocentric bias when creating datasets for a less-resourced language; choosing the most appropriate measures; documenting the datasets and making the leaderboard convenient and transparent. We also highlight other potential usages of the dataset, such as, for instance, the evaluation of cross-lingual transfer learning.
pdf
bib
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
Nikolai Ilinykh
|
Felix Morger
|
Dana Dannélls
|
Simon Dobnik
|
Beáta Megyesi
|
Joakim Nivre
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)
2021
pdf
abs
The Swedish Winogender Dataset
Saga Hansson
|
Konstantinos Mavromatakis
|
Yvonne Adesam
|
Gerlof Bouma
|
Dana Dannélls
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
We introduce the SweWinogender test set, a diagnostic dataset to measure gender bias in coreference resolution. It is modelled after the English Winogender benchmark, and is released with reference statistics on the distribution of men and women between occupations and the association between gender and occupation in modern corpus material. The paper discusses the design and creation of the dataset, and presents a small investigation of the supplementary statistics.
pdf
abs
OCR Processing of Swedish Historical Newspapers Using Deep Hybrid CNN–LSTM Networks
Molly Brandt Skelbye
|
Dana Dannélls
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Deep CNN–LSTM hybrid neural networks have proven to improve the accuracy of Optical Character Recognition (OCR) models for different languages. In this paper we examine to what extent these networks improve the OCR accuracy rates on Swedish historical newspapers. By experimenting with the open source OCR engine Calamari, we are able to show that mixed deep CNN–LSTM hybrid models outperform previous models on the task of character recognition of Swedish historical newspapers spanning 1818–1848. We achieved an average character accuracy rate (CAR) of 97.43% which is a new state–of–the–art result on 19th century Swedish newspaper text. Our data, code and models are released under CC-BY licence.
pdf
abs
A Novel Machine Learning Based Approach for Post-OCR Error Detection
Shafqat Mumtaz Virk
|
Dana Dannélls
|
Azam Sheikh Muhammad
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.
pdf
abs
A Data-Driven Semi-Automatic Framenet Development Methodology
Shafqat Mumtaz Virk
|
Dana Dannélls
|
Lars Borin
|
Markus Forsberg
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
FrameNet is a lexical semantic resource based on the linguistic theory of frame semantics. A number of framenet development strategies have been reported previously and all of them involve exploration of corpora and a fair amount of manual work. Despite previous efforts, there does not exist a well-thought-out automatic/semi-automatic methodology for frame construction. In this paper we propose a data-driven methodology for identification and semi-automatic construction of frames. As a proof of concept, we report on our initial attempts to build a wider-scale framenet for the legal domain (LawFN) using the proposed methodology. The constructed frames are stored in a lexical database and together with the annotated example sentences they have been made available through a web interface.
2020
pdf
abs
Material Philology Meets Digital Onomastic Lexicography: The NordiCon Database of Medieval Nordic Personal Names in Continental Sources
Michelle Waldispühl
|
Dana Dannells
|
Lars Borin
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present NordiCon, a database containing medieval Nordic personal names attested in Continental sources. The database combines formally interpreted and richly interlinked onomastic data with digitized versions of the medieval manuscripts from which the data originate and information on the tokens’ context. The structure of NordiCon is inspired by other online historical given name dictionaries. It takes up challenges reported on in previous works, such as how to cover material properties of a name token and how to define lemmatization principles, and elaborates on possible solutions. The lemmatization principles for NordiCon are further developed in order to facilitate the connection to other name dictionaries and corpuses, and the integration of the database into Språkbanken Text, an infrastructure containing modern and historical written data.
2015
pdf
bib
Polysemy, underspecification, and aspects – Questions of lumping or splitting in the construction of Swedish FrameNet
Karin Friberg Heppin
|
Dana Dannélls
Proceedings of the workshop on Semantic resources and semantic annotation for Natural Language Processing and the Digital Humanities at NODALIDA 2015
pdf
Formalising the Swedish Constructicon in Grammatical Framework
Normunds Gruzitis
|
Dana Dannélls
|
Benjamin Lyngfelt
|
Aarne Ranta
Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop
2014
pdf
abs
Extracting a bilingual semantic grammar from FrameNet-annotated corpora
Dana Dannélls
|
Normunds Gruzitis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present the creation of an English-Swedish FrameNet-based grammar in Grammatical Framework. The aim of this research is to make existing framenets computationally accessible for multilingual natural language applications via a common semantic grammar API, and to facilitate the porting of such grammar to other languages. In this paper, we describe the abstract syntax of the semantic grammar while focusing on its automatic extraction possibilities. We have extracted a shared abstract syntax from ~58,500 annotated sentences in Berkeley FrameNet (BFN) and ~3,500 annotated sentences in Swedish FrameNet (SweFN). The abstract syntax defines 769 frame-specific valence patterns that cover 77,8% examples in BFN and 74,9% in SweFN belonging to the shared set of 471 frames. As a side result, we provide a unified method for comparing semantic and syntactic valence patterns across framenets.
pdf
bib
Using language technology resources and tools to construct Swedish FrameNet
Dana Dannélls
|
Karin Friberg Heppin
|
Anna Ehrlemark
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing
2013
pdf
Multilingual access to cultural heritage content on the Semantic Web
Dana Dannélls
|
Aarne Ranta
|
Ramona Enache
|
Mariana Damova
|
Maria Mateva
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
2012
pdf
Toward Language Independent Methodology for Generating Artwork Descriptions – Exploring FrameNet Information
Dana Dannélls
|
Lars Borin
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
pdf
On generating coherent multilingual descriptions of museum objects from Semantic Web ontologies
Dana Dannélls
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference
2011
pdf
bib
A Framework for Improved Access to Museum Databases in the Semantic Web
Dana Dannélls
|
Mariana Damova
|
Ramona Enache
|
Milen Chechev
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage
2010
pdf
Applying Semantic Frame Theory to Automate Natural Language Template Generation From Ontology Statements
Dana Dannélls
Proceedings of the 6th International Natural Language Generation Conference
2006
pdf
abs
Recognizing Acronyms and their Definitions in Swedish Medical Texts
Dimitrios Kokkinakis
|
Dana Dannélls
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper addresses the task of recognizing acronym-definition pairs in Swedish (medical) texts as well as the compilation of a freely available sample of such manually annotated pairs. A material suitable not only for supervised learning experiments, but also as a testbed for the evaluation of the quality of future acronym-definition recognition systems. There are a number of approaches to the identification described in the literature, particularly within the biomedical domain, but none of those addresses the variation and complexity exhibited in a language other than English. This is realized by the fact that we can have a mixture of two languages in the same document and/or sentence, i.e. Swedish and English; that Swedish is a compound language that significantly deteriorates the performance of previous approaches (without adaptations) and, most importantly, the fact that there is a large variation of possible acronym-definition permutations realized in the analysed corpora, a variation that is usually ignored in previous studies.
pdf
Automatic Acronym Recognition
Dana Dannélls
Demonstrations