Dávid í Lág

Also published as: Dávid í Lág

2026

FPSC: A Sustainable Pipeline for Building a Faroese Parliamentary Speech Corpus
Dávid í Lág | Barbara Scalvini | Carlos Daniel Hernandez Mena | Jon Gudnason
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This work addresses the lack of large-scale, natural speech data for Faroese automatic speech recognition. Existing resources, such as the 100-hour Ravnursson corpus, consist of read speech and do not capture the spontaneous variation, sociolinguistic aspects and prosody of real dialogue, limiting model performance. To overcome this, we present the Faroese Parliament Speech Corpus (FPSC)—a 1,600-hour collection of parliamentary recordings comprising 89,000 speeches with detailed speaker and linguistic metadata. The corpus includes weakly supervised transcriptions generated using an ensemble of four Faroese-adapted ASR models combined through a ROVER-based voting procedure. In creating FPSC, we trained several new state-of-the-art ASR models for Faroese—some built on large-scale pretrained backbones and others leveraging multilingual transfer—all outperforming previously published Faroese ASR systems. FPSC represents the first corpus of natural spoken Faroese and a major step toward realistic ASR modeling for Faroese, offering an open, reproducible, and scalable resource for future speech and language research.

2025

pdf bib abs

Mapping Faroese in the Multilingual Representation Space: Insights for ASR Model Optimization
Dávid í Lág | Barbara Scalvini | Jon Gudnason
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

ASR development for low-resource languages like Faroese faces significant challenges due to the scarcity of large, diverse datasets. While fine-tuning multilingual models using related languages is a common practice, there is no standardized method for selecting these auxiliary languages, leading to a computationally expensive trial-and-error process. By analyzing Faroese’s positioning among other languages in wav2vec2’s multilingual representation space, we find that Faroese’s closest neighbors are influenced not only by linguistic similarity but also by historical, phonetic, and cultural factors. These findings open new avenues for auxiliary language selection to improve Faroese ASR and underscore the potential value of data-driven factors in ASR fine-tuning.

pdf bib abs

Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0
Carlos Daniel Hernández Mena | Barbara Scalvini | Dávid í Lág
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Mozilla Common Voice is a crowdsourced project that aims to create a public, multilingual dataset of voice recordings for training speech recognition models. In Common Voice, anyone can contribute by donating or validating recordings in various languages. However, despite the availability of many recordings in certain languages, a significant percentage remains unvalidated by users. This is the case for Spanish, where in version 17.0 of Common Voice, 75% of the 2,220 hours of recordings are unvalidated. In this work, we used the Whisper recognizer to automatically validate approximately 784 hours of recordings which are more than the 562 hours validated by users. To verify the accuracy of the validation, we developed a speech recognition model based on a version of NVIDIA-NeMo’s Parakeet, which does not have an official Spanish version. Our final model achieved a WER of less than 4% on the test and validation splits of Common Voice 17.0. Both the model and the speech corpus are publicly available on Hugging Face.

Co-authors

Venues

Fix author