Taja Kuzman Pungeršek

2026

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian
Nikola Ljubešić | Peter Rupnik | Ivan Porupski | Taja Kuzman Pungeršek
Proceedings of the Fifteenth Language Resources and Evaluation Conference

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages – Croatian, Czech, Polish and Serbian – with a total size of more than 6 thousand hours. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora has been significantly enriched with several automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similarly, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent type of disfluency in typical speech. Two languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the corpora has been greatly increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

pdf bib abs

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek | Peter Rupnik | Vit Suchomel | Nikola Ljubešić
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

Co-authors

Venues

LREC2

Fix author