Hrafn Loftsson


2023

pdf
Is Part-of-Speech Tagging a Solved Problem for Icelandic?
Örvar Kárason | Hrafn Loftsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models that obtained the highest accuracy for Icelandic when they were introduced. The fourth model is of a type that currently reaches state-of-the-art accuracy. We use the most recent version of the MIM-GOLD training/testing corpus, its newest tagset, and augmentation data to obtain results that are comparable between the various models. We examine the accuracy improvements with each model and analyse the errors produced by our transformer model, which is based on a previously published ConvBERT model. For the set of errors that all the models make, and for which they predict the same tag, we extract a random subset for manual inspection. Extrapolating from this subset, we obtain a lower bound estimate on annotation errors in the corpus as well as on some unsolvable tagging errors. We argue that further tagging accuracy gains for Icelandic can still be obtained by fixing the errors in MIM-GOLD and, furthermore, that it should still be possible to squeeze out some small gains from our transformer model.

pdf
Microservices at Your Service: Bridging the Gap between NLP Research and Industry
Tiina Lindh-Knuutila | Hrafn Loftsson | Pedro Alonso Doval | Sebastian Andersson | Bjarni Barkarson | Héctor Cerezo-Costas | Jon Gudnason | Jökull Gylfason | Jarmo Hemminki | Heiki-Jaan Kaalep
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

This paper describes a collaborative European project whose aim was to gather open source Natural Language Processing (NLP) tools and make them accessible as running services and easy to try out in the European Language Grid (ELG). The motivation of the project was to increase accessibility for more European languages and make it easier for developers to use the underlying tools in their own applications. The project resulted in the containerization of 60 existing NLP tools for 16 languages, all of which are now currently running as easily testable services in the ELG platform.

pdf
Filtering Matters: Experiments in Filtering Training Sets for Machine Translation
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.

pdf
GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets
Njall Skarphedinsson | Breki Gudmundsson | Steinar Smari | Marta Kristin Larusdottir | Hafsteinn Einarsson | Abuzar Khan | Eric Nyberg | Hrafn Loftsson
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

The methods used to create many of the well-known Question-Answering (QA) datasets are hard to replicate for low-resource languages. A commonality amongst these methods is hiring annotators to source answers from the internet by querying a single answer source, such as Wikipedia. Applying these methods for low-resource languages can be problematic since there is no single large answer source for these languages. Consequently, this can result in a high ratio of unanswered questions, since the amount of information in any single source is limited. To address this problem, we developed a novel crowd-sourcing platform to gather multiple-domain QA data for low-resource languages. Our platform, which consists of a mobile app and a web API, gamifies the data collection process. We successfully released the app for Icelandic (a low-resource language with about 350,000 native speakers) to build a dataset which rivals large QA datasets for high-resource languages both in terms of size and ratio of answered questions. We have made the platform open source with instructions on how to localize and deploy it to gather data for other low-resource languages.

2022

pdf
Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches
Steinþór Steingrímsson | Luke O’Brien | Finnur Ingimundarson | Hrafn Loftsson | Andy Way
Proceedings of Globalex Workshop on Linked Lexicography within the 13th Language Resources and Evaluation Conference

Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.

pdf
National Language Technology Platform (NLTP): overall view
Artūrs Vasiļevskis | Jānis Ziediņš | Marko Tadić | Željka Motika | Mark Fishel | Hrafn Loftsson | Jón Gu | Claudia Borg | Keith Cortis | Judie Attard | Donatienne Spiteri
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

The work in progress on the CEF Action National Language Technology Platform (NLTP) is presented. The Action aims at combining the most advanced Language Technology (LT) tools and solutions in a new state-of-the-art, Artificial Intelli- gence (AI) driven, National Language Technology Platform (NLTP).

pdf
Pre-training and Evaluating Transformer-based Language Models for Icelandic
Jón Friðrik Daðason | Hrafn Loftsson
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we evaluate several Transformer-based language models for Icelandic on four downstream tasks: Part-of-Speech tagging, Named Entity Recognition. Dependency Parsing, and Automatic Text Summarization. We pre-train four types of monolingual ELECTRA and ConvBERT models and compare our results to a previously trained monolingual RoBERTa model and the multilingual mBERT model. We find that the Transformer models obtain better results, often by a large margin, compared to previous state-of-the-art models. Furthermore, our results indicate that pre-training larger language models results in a significant reduction in error rates in comparison to smaller models. Finally, our results show that the monolingual models for Icelandic outperform a comparably sized multilingual model.

pdf
Semi-supervised Automated Clinical Coding Using International Classification of Diseases
Hlynur Hlynsson | Steindór Ellertsson | Jon Dadason | Emil Sigurdsson | Hrafn Loftsson
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf
National Language Technology Platform for Public Administration
Marko Tadić | Daša Farkaš | Matea Filko | Artūrs Vasiļevskis | Andrejs Vasiļjevs | Jānis Ziediņš | Željka Motika | Mark Fishel | Hrafn Loftsson | Jón Guðnason | Claudia Borg | Keith Cortis | Judie Attard | Donatienne Spiteri
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference

This article presents the work in progress on the collaborative project of several European countries to develop National Language Technology Platform (NLTP). The project aims at combining the most advanced Language Technology tools and solutions in a new, state-of-the-art, Artificial Intelligence driven, National Language Technology Platform for five EU/EEA official and lower-resourced languages.

pdf
Building an Icelandic Entity Linking Corpus
Steinunn Rut Friðriksdóttir | Valdimar Ágúst Eggertsson | Benedikt Geir Jóhannesson | Hjalti Daníelsson | Hrafn Loftsson | Hafsteinn Einarsson
Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference

In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach of using a multilingual entity linking model (mGENRE) in combination with Wikipedia API Search (WAPIS) to label our data and compare it to an approach using WAPIS only. We find that our combined method reaches 53.9% coverage on our corpus, compared to 30.9% using only WAPIS. We analyze our results and explain the value of using a multilingual system when working with Icelandic. Additionally, we analyze the data that remain unlabeled, identify patterns and discuss why they may be more difficult to annotate.

2021

pdf
Effective Bitext Extraction From Comparable Corpora Using a Combination of Three Different Approaches
Steinþór Steingrímsson | Pintu Lohar | Hrafn Loftsson | Andy Way
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Parallel sentences extracted from comparable corpora can be useful to supplement parallel corpora when training machine translation (MT) systems. This is even more prominent in low-resource scenarios, where parallel corpora are scarce. In this paper, we present a system which uses three very different measures to identify and score parallel sentences from comparable corpora. We measure the accuracy of our methods in low-resource settings by comparing the results against manually curated test data for English–Icelandic, and by evaluating an MT system trained on the concatenation of the parallel data extracted by our approach and an existing data set. We show that the system is capable of extracting useful parallel sentences with high accuracy, and that the extracted pairs substantially increase translation quality of an MT system trained on the data, as measured by automatic evaluation metrics.

pdf
CombAlign: a Tool for Obtaining High-Quality Word Alignments
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Being able to generate accurate word alignments is useful for a variety of tasks. While statistical word aligners can work well, especially when parallel training data are plentiful, multilingual embedding models have recently been shown to give good results in unsupervised scenarios. We evaluate an ensemble method for word alignment on four language pairs and demonstrate that by combining multiple tools, taking advantage of their different approaches, substantial gains can be made. This holds for settings ranging from very low-resource to high-resource. Furthermore, we introduce a new gold alignment test set for Icelandic and a new easy-to-use tool for creating manual word alignments.

pdf bib
IceSum: An Icelandic Text Summarization Corpus
Jón Daðason | Hrafn Loftsson | Salome Sigurðardóttir | Þorsteinn Björnsson
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

Automatic Text Summarization (ATS) is the task of generating concise and fluent summaries from one or more documents. In this paper, we present IceSum, the first Icelandic corpus annotated with human-generated summaries. IceSum consists of 1,000 online news articles and their extractive summaries. We train and evaluate several neural network-based models on this dataset, comparing them against a selection of baseline methods. We find that an encoder-decoder model with a sequence-to-sequence based extractor obtains the best results, outperforming all baseline methods. Furthermore, we evaluate how the size of the training corpus affects the quality of the generated summaries. We release the corpus and the models with an open license.

2020

pdf
Effectively Aligning and Filtering Parallel Corpora under Sparse Data Conditions
Steinþór Steingrímsson | Hrafn Loftsson | Andy Way
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Parallel corpora are key to developing good machine translation systems. However, abundant parallel data are hard to come by, especially for languages with a low number of speakers. When rich morphology exacerbates the data sparsity problem, it is imperative to have accurate alignment and filtering methods that can help make the most of what is available by maximising the number of correctly translated segments in a corpus and minimising noise by removing incorrect translations and segments containing extraneous data. This paper sets out a research plan for improving alignment and filtering methods for parallel texts in low-resource settings. We propose an effective unsupervised alignment method to tackle the alignment problem. Moreover, we propose a strategy to supplement state-of-the-art models with automatically extracted information using basic NLP tools to effectively handle rich morphology.

pdf
Language Technology Programme for Icelandic 2019-2023
Anna Nikulásdóttir | Jón Guðnason | Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Steinþór Steingrímsson
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we describe a new national language technology programme for Icelandic. The programme, which spans a period of five years, aims at making Icelandic usable in communication and interactions in the digital world, by developing accessible, open-source language resources and software. The research and development work within the programme is carried out by a consortium of universities, institutions, and private companies, with a strong emphasis on cooperation between academia and industries. Five core projects will be the main content of the programme: language resources, speech recognition, speech synthesis, machine translation, and spell and grammar checking. We also describe other national language technology programmes and give an overview over the history of language technology in Iceland.

pdf
Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic
Jón Daðason | David Mollberg | Hrafn Loftsson | Kristín Bjarnadóttir
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.

2019

pdf
Augmenting a BiLSTM Tagger with a Morphological Lexicon and a Lexical Category Identification Step
Steinþór Steingrímsson | Örvar Kárason | Hrafn Loftsson
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any other previously published tagger, when not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we outperform the earlier state-of-the-art results by a significant margin. We also report on work in progress that attempts to address the problem of data sparsity inherent to morphologically detailed, fine-grained tagsets. We experiment with training a separate model on only the lexical category and using the coarse-grained output tag as an input into to the main model. This method further increases the accuracy and reduces the tagging errors by 21.3% compared to previous state-of-the-art results. Finally, we train and test our tagger on a new gold standard for Icelandic.

pdf
A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System
Vilhjálmur Þorsteinsson | Hulda Óladóttir | Hrafn Loftsson
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We present an open-source, wide-coverage context-free grammar (CFG) for Icelandic, and an accompanying parsing system. The grammar has over 5,600 nonterminals, 4,600 terminals and 19,000 productions in fully expanded form, with feature agreement constraints for case, gender, number and person. The parsing system consists of an enhanced Earley-based parser and a mechanism to select best-scoring parse trees from shared packed parse forests. Our parsing system is able to parse about 90% of all sentences in articles published on the main Icelandic news websites. Preliminary evaluation with evalb shows an F-measure of 70.72% on parsed sentences. Our system demonstrates that parsing a morphologically rich language using a wide-coverage CFG can be practical.

pdf
Nefnir: A high accuracy lemmatizer for Icelandic
Svanhvít Lilja Ingólfsdóttir | Hrafn Loftsson | Jón Friðrik Daðason | Kristín Bjarnadóttir
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.

pdf
Towards High Accuracy Named Entity Recognition for Icelandic
Svanhvít Lilja Ingólfsdóttir | Sigurjón Þorsteinsson | Hrafn Loftsson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.

2014

bib
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Nicoletta Calzolari | Khalid Choukri | Thierry Declerck | Hrafn Loftsson | Bente Maegaard | Joseph Mariani | Asuncion Moreno | Jan Odijk | Stelios Piperidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

pdf
Correcting Errors in a New Gold Standard for Tagging Icelandic Text
Sigrún Helgadóttir | Hrafn Loftsson | Eiríkur Rögnvaldsson
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the correction of PoS tags in a new Icelandic corpus, MIM-GOLD, consisting of about 1 million tokens sampled from the Tagged Icelandic Corpus, MÍM, released in 2013. The goal is to use the corpus, among other things, as a new gold standard for training and testing PoS taggers. The construction of the corpus was first described in 2010 together with preliminary work on error detection and correction. In this paper, we describe further the correction of tags in the corpus. We describe manual correction and a method for semi-automatic error detection and correction. We show that, even after manual correction, the number of tagging errors in the corpus can be reduced significantly by applying our semi-automatic detection and correction method. After the semi-automatic error correction, preliminary evaluation of tagging accuracy shows very low error rates. We hope that the existence of the corpus will make it possible to improve PoS taggers for Icelandic text.

pdf
Rapid Deployment of Phrase Structure Parsing for Related Languages: A Case Study of Insular Scandinavian
Anton Karl Ingason | Hrafn Loftsson | Eiríkur Rögnvaldsson | Einar Freyr Sigurðsson | Joel C. Wallenberg
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents ongoing work that aims to improve machine parsing of Faroese using a combination of Faroese and Icelandic training data. We show that even if we only have a relatively small parsed corpus of one language, namely 53,000 words of Faroese, we can obtain better results by adding information about phrase structure from a closely related language which has a similar syntax. Our experiment uses the Berkeley parser. We demonstrate that the addition of Icelandic data without any other modification to the experimental setup results in an f-measure improvement from 75.44% to 78.05% in Faroese and an improvement in part-of-speech tagging accuracy from 88.86% to 90.40%.

2013

pdf
Tagging the Past: Experiments using the Saga Corpus
Hrafn Loftsson
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf
Tagging a Morphologically Complex Language Using an Averaged Perceptron Tagger: The Case of Icelandic
Hrafn Loftsson | Robert Östling
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2011

pdf
Using a Morphological Database to Increase the Accuracy in POS Tagging
Hrafn Loftsson | Sigrún Helgadóttir | Eiríkur Rögnvaldsson
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

pdf
Correcting a POS-Tagged Corpus Using Three Complementary Methods
Hrafn Loftsson
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
Improving the PoS tagging accuracy of Icelandic text
Hrafn Loftsson | Ida Kramarczyk | Sigrún Helgadóttir | Eiríkur Rögnvaldsson
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf
Context-Sensitive Spelling Correction and Rich Morphology
Anton K. Ingason | Skúli B. Jóhannsson | Eiríkur Rögnvaldsson | Hrafn Loftsson | Sigrún Helgadóttir
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2007

pdf
Tagging Icelandic Text using a Linguistic and a Statistical Tagger
Hrafn Loftsson
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf
IceParser: An Incremental Finite-State Parser for Icelandic
Hrafn Loftsson | Eiríkur Rögnvaldsson
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)