Darinka Verdonik


2026

Slovenian is a less-resourced South Slavic language. Existing Slovenian spoken language resources mainly cover the standard language in everyday communication. However, Slovenian encompasses a wide range of dialects, most of which are not represented in available spoken language resources. This paper presents the development of Zila, a Slovenian spoken language resource for the Gail Valley dialect. This dialect is one of the most endangered varieties of Slovenian and is spoken in the extreme north-western periphery of the Slovenian language area. The goal of the project was to build a language resource comprising 100 hours of speech with manually produced transcriptions. The spoken material was collected from members of the Slovenian minority in Carinthia, Austria, with the local community playing a key role in the data acquisition process. A dedicated set of transcription rules was created to capture the full range of acoustic and linguistic features of the Gail Valley dialect, which differs significantly from standard Slovenian. A preliminary speech recognition experiment was conducted to analyze these differences further. The Zila project demonstrates how spoken language technologies can help to preserve the cultural and linguistic heritage of an endangered dialect.
We present ROG, the first manually annotated spoken corpus of Slovenian to integrate morphosyntactic, prosodic, and interactional layers in a unified framework. Building on the pre-existing Spoken Slovenian Treebank (SST) and newly available recordings from the GOS 2 reference corpus, the resource combines over 75,000 words (10 hours) of annotated speech. The entire corpus features lemmatization, MULTEXT-East morphosyntax, and Universal Dependencies annotations, while approximately half includes additional layers for prosodic units, disfluencies, and dialogue acts. All annotation layers are systematically aligned and cross-referenced, enabling detailed multi-dimensional analyses of spoken language. We describe the corpus design, annotation workflow, data release, and baseline modeling results, showcasing the resource’s value for both linguistic analysis and speech-aware NLP model development. All ROG transcriptions and annotations, along with half of the audio recordings, are freely available under CC-BY via (anonymized) repository.

2024

This paper introduces a new version of the Gos reference corpus of spoken Slovenian, which was recently extended to more than double the original size (300 hours, 2.4 million words) by adding speech recordings and transcriptions from two related initiatives, the Gos VideoLectures corpus of public academic speech, and the Artur speech recognition database. We describe this process by first presenting the criteria guiding the balanced selection of the newly added data and the challenges encountered when merging language resources with divergent designs, followed by the presentation of other major enhancements of the new Gos corpus, such as improvements in lemmatization and morphosyntactic annotation, word-level speech alignment, a new XML schema and the development of a specialized online concordancer.

2016

This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.

2014

The aim of the paper is to search for common guidelines for the future development of speech databases for less resourced languages in order to make them the most useful for both main fields of their use, linguistic research and speech technologies. We compare two standards for creating speech databases, one followed when developing the Slovene speech database for automatic speech recognition ― BNSI Broadcast News, the other followed when developing the Slovene reference speech corpus GOS, and outline possible common guidelines for future work. We also present an add-on for the GOS corpus, which enables its usage for automatic speech recognition.

2006

This paper presents the SINOD database, which is the first Slovenian non-native speech database. It will be used to improve the performance of large vocabulary continuous speech recogniser for non-native speakers. The main quality impact is expected for acoustic models and recogniser’s vocabulary. The SINOD database is designed as supplement to the Slovenian BNSI Broadcast News database. The same BN recommendations were used for both databases. Two interviews with non-native Slovenian speakers were incorporated in the set. Both non-native speakers were female, whereas the journalist was Slovenian native male speaker. The transcription approach applied in the production phase is presented. Different statistics and analyses of database are given in the paper.
The paper represents the Turdis database of spontaneous conversations in tourist domain in Slovenian language. Database was built for use in developing speech-to-speech translation components, however it can be used also for developing dialog systems or used for linguistic researches. The idea was to record a database of telephone conversations in tourism where the naturalness of conversations is affected as little as possible while we still obtain a permission for recording from all the speakers. When recording in studio environment there can be many problems. It is especially difficult to imitate a tourist agent if a speaker does not have such experiences and therefore lacks the background knowledge that a tourist agent has. Therefore the Turdis database was recorded with professional tourist agents. The agreement with local tourist companies enabled that we recorded a tourist agent while he was at his working place in his working time answering the telephone. Callers were contacted individually and asked to use the Turdis system and make a call to selected tourist company. Technically the recording was done using PC ISDN card. Database was orthographically transcribed with Transcriber tool. At the present it includes cca 43 000 words.

2004

2002