Marie Iversdatter Røsok
2025
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway
Tita Enstad
|
Trond Trosterud
|
Marie Iversdatter Røsok
|
Yngvil Beyer
|
Marie Roald
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Optical Character Recognition (OCR) is crucial to the National Library of Norway’s (NLN) digitisation process as it converts scanned documents into machine-readable text. However, for the Sámi documents in NLN’s collection, the OCR accuracy is insufficient. Given that OCR quality affects downstream processes, evaluating and improving OCR for text written in Sámi languages is necessary to make these resources accessible. To address this need, this work fine-tunes and evaluates three established OCR approaches, Transkribus, Tesseract and TrOCR, for transcribing Sámi texts from NLN’s collection. Our results show that Transkribus and TrOCR outperform Tesseract on this task, while Tesseract achieves superior performance on an out-of-domain dataset. Furthermore, we show that fine-tuning pre-trained models and supplementing manual annotations with machine annotations and synthetic text images can yield accurate OCR for Sámi languages, even with a moderate amount of manually annotated data.
2024
NB Uttale: A Norwegian Pronunciation Lexicon with Dialect Variation
Marie Iversdatter Røsok
|
Ingerid Løyning Dale
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a Norwegian pronunciation lexicon with Bokmål orthographic word forms and up to eight alternate phonological transcriptions per word form. The lexicon covers dialectal variations for five geographical areas, as well as pronunciation variations for spontaneous and manuscript-read speech. It is based on the NST Bokmål lexicon for East Norwegian, whose original phonological transcriptions have been corrected, before they were converted with dialect specific regular expression rules. To evaluate the quality and consistency of the new, rule-generated transcriptions, we trained grapheme-to phoneme (G2P) models and report our results with word- (WER) and phoneme-error-rate (PER) metrics. We found that the G2P models trained on lexica for Southwest and West Norwegian close-to written transcriptions have the lowest WER scores, and that all error-corrected, close-to-written lexica yield better WER scores than the original NST lexicon. The lexicon is available under an open license, and can be used for various language technology applications and in linguistic research.