Marthe Løken Midtgaard
2025
Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova
|
Jonas Sebulon Frydenberg
|
Victoria Handford
|
Victoria Ovedie Chruickshank Langø
|
Solveig Helene Willoch
|
Marthe Løken Midtgaard
|
Yves Scherrer
|
Petter Mæhlum
|
David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.