Per E. Kummervold
Also published as: Per E Kummervold
2026
Building a One-Million-Pair Bokmål–Nynorsk Translation Corpus: A Quality-First Harvesting and Cleaning Pipeline
Per E. Kummervold | Thea Tollersrud | Angelina Zanardi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Per E. Kummervold | Thea Tollersrud | Angelina Zanardi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present a high-quality parallel corpus for translation between Norwegian Bokmål (nb) and Nynorsk (nn), two closely related written standards of Norwegian. The corpus was assembled from two complementary sources: Nasjonal digital læringsarena (NDLA), an educational platform, and Nynorsk pressekontor (NPK), a newswire service. Our methodology prioritizes precision over volume, employing a multi-stage filtering pipeline designed to address the specific challenges of aligning near-neighbor languages. This pipeline combines paragraph-level alignment, deduplication, multilingual semantic similarity scoring, language identification confidence checks, structural consistency tests, and strict bidirectional adjudication by a Large Language Model (LLM). To address the common problem of untranslated or placeholder "pending" copies, we apply a rule that flags pairs with zero semantic distance when the Nynorsk side shows weak evidence of being distinctively Nynorsk. After filtering, we retained 191,695 pairs from NDLA and 809,164 pairs from NPK, resulting in a merged corpus of 1,000,859 parallel paragraphs. This resource demonstrates that a precision-oriented pipeline can produce data better suited for training robust machine translation systems and instruction-tuned models than larger but noisier alternatives.
2023
A Manual Evaluation Method of Neural MT for Indigenous Languages
Linda Wiechetek | Flammie A. Pirinen | Per E Kummervold
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Linda Wiechetek | Flammie A. Pirinen | Per E Kummervold
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Indigenous language expertise is not encoded in written text in the same way as it is for languages that have a long literal tradition. In many cases it is, on the contrary, mostly conserved orally. Therefore the evaluation of neural MT systems solely based on an algorithm learning from written texts is not adequate to measure the quality of a system that is used by the language community. If extensively using tools based on a big amount of non-native language this can even contribute to language change in a way that is not desired by the language community. It can also pollute the internet with automatically created texts that outweigh native texts. We propose a manual evaluation method focusing on flow and content separately, and additionally we use existing rule-based NLP to evaluate other factors such as spelling, grammar and grammatical richness. Our main conclusion is that language expertise of a native speaker is necessary to properly evaluate a given system. We test the method by manually evaluating two neural MT tools for an indigenous low resource language. We present an experiment on two different neural translations to and from North Sámi, an indigenous language of North Europe.
2021
Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model
Per E Kummervold | Javier De la Rosa | Freddy Wetjen | Svein Arne Brygfjeld
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Per E Kummervold | Javier De la Rosa | Freddy Wetjen | Svein Arne Brygfjeld
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokmål and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.