Örvar Kárason
2023
Is Part-of-Speech Tagging a Solved Problem for Icelandic?
Örvar Kárason
|
Hrafn Loftsson
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models that obtained the highest accuracy for Icelandic when they were introduced. The fourth model is of a type that currently reaches state-of-the-art accuracy. We use the most recent version of the MIM-GOLD training/testing corpus, its newest tagset, and augmentation data to obtain results that are comparable between the various models. We examine the accuracy improvements with each model and analyse the errors produced by our transformer model, which is based on a previously published ConvBERT model. For the set of errors that all the models make, and for which they predict the same tag, we extract a random subset for manual inspection. Extrapolating from this subset, we obtain a lower bound estimate on annotation errors in the corpus as well as on some unsolvable tagging errors. We argue that further tagging accuracy gains for Icelandic can still be obtained by fixing the errors in MIM-GOLD and, furthermore, that it should still be possible to squeeze out some small gains from our transformer model.
2019
Augmenting a BiLSTM Tagger with a Morphological Lexicon and a Lexical Category Identification Step
Steinþór Steingrímsson
|
Örvar Kárason
|
Hrafn Loftsson
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Previous work on using BiLSTM models for PoS tagging has primarily focused on small tagsets. We evaluate BiLSTM models for tagging Icelandic, a morphologically rich language, using a relatively large tagset. Our baseline BiLSTM model achieves higher accuracy than any other previously published tagger, when not taking advantage of a morphological lexicon. When we extend the model by incorporating such data, we outperform the earlier state-of-the-art results by a significant margin. We also report on work in progress that attempts to address the problem of data sparsity inherent to morphologically detailed, fine-grained tagsets. We experiment with training a separate model on only the lexical category and using the coarse-grained output tag as an input into to the main model. This method further increases the accuracy and reduces the tagging errors by 21.3% compared to previous state-of-the-art results. Finally, we train and test our tagger on a new gold standard for Icelandic.
Search