Per E. Kummervold

Also published as: Per E Kummervold


2026

We present a high-quality parallel corpus for translation between Norwegian Bokmål (nb) and Nynorsk (nn), two closely related written standards of Norwegian. The corpus was assembled from two complementary sources: Nasjonal digital læringsarena (NDLA), an educational platform, and Nynorsk pressekontor (NPK), a newswire service. Our methodology prioritizes precision over volume, employing a multi-stage filtering pipeline designed to address the specific challenges of aligning near-neighbor languages. This pipeline combines paragraph-level alignment, deduplication, multilingual semantic similarity scoring, language identification confidence checks, structural consistency tests, and strict bidirectional adjudication by a Large Language Model (LLM). To address the common problem of untranslated or placeholder "pending" copies, we apply a rule that flags pairs with zero semantic distance when the Nynorsk side shows weak evidence of being distinctively Nynorsk. After filtering, we retained 191,695 pairs from NDLA and 809,164 pairs from NPK, resulting in a merged corpus of 1,000,859 parallel paragraphs. This resource demonstrates that a precision-oriented pipeline can produce data better suited for training robust machine translation systems and instruction-tuned models than larger but noisier alternatives.

2023

Indigenous language expertise is not encoded in written text in the same way as it is for languages that have a long literal tradition. In many cases it is, on the contrary, mostly conserved orally. Therefore the evaluation of neural MT systems solely based on an algorithm learning from written texts is not adequate to measure the quality of a system that is used by the language community. If extensively using tools based on a big amount of non-native language this can even contribute to language change in a way that is not desired by the language community. It can also pollute the internet with automatically created texts that outweigh native texts. We propose a manual evaluation method focusing on flow and content separately, and additionally we use existing rule-based NLP to evaluate other factors such as spelling, grammar and grammatical richness. Our main conclusion is that language expertise of a native speaker is necessary to properly evaluate a given system. We test the method by manually evaluating two neural MT tools for an indigenous low resource language. We present an experiment on two different neural translations to and from North Sámi, an indigenous language of North Europe.

2021

In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokmål and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.