Ivan Porupski

2026

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Nikola Ljubešić | Peter Rupnik | Ivan Porupski | Taja Kuzman Pungeršek
Proceedings of the Fifteenth Language Resources and Evaluation Conference

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages – Croatian, Czech, Polish and Serbian – with a total size of more than 6 thousand hours. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora has been significantly enriched with several automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similarly, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent type of disfluency in typical speech. Two languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the corpora has been greatly increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

2025

pdf bib abs

Identifying Filled Pauses in Speech Across South and West Slavic Languages
Nikola Ljubešić | Ivan Porupski | Peter Rupnik
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

Filled pauses are among the most common paralinguistic features of speech, yet they are mainly omitted from transcripts. We propose a transformer-based approach for detecting filled pauses directly from the speech signal, fine-tuned on Slovenian and evaluated across South and West Slavic languages. Our results show that speech transformers achieve excellent performance in detecting filled pauses when evaluated in the in-language scenario. We further evaluate cross-lingual capabilities of the model on two closely related South Slavic languages (Croatian and Serbian) and two less closely related West Slavic languages (Czech and Polish). Our results reveal strong cross-lingual generalization capabilities of the model, with only minor performance drops. Moreover, error analysis reveals that the model outperforms human annotators in recall and F1 score, while trailing slightly in precision. In addition to evaluating the capabilities of speech transformers for filled pause detection across Slavic languages, we release new multilingual test datasets and make our fine-tuned model publicly available to support further research and applications in spoken language processing.

Co-authors

Venues

Fix author